summaryrefslogtreecommitdiff
path: root/mm/migrate.c
AgeCommit message (Collapse)Author
2023-12-20mm: migrate high-order folios in swap cache correctlyCharan Teja Kalla
Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. [willy@infradead.org: modifications to the changelog & tweaked the fix] Fixes: 3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: migrate: record the mlocked page status to remove unnecessary lru drainBaolin Wang
When doing compaction, I found the lru_add_drain() is an obvious hotspot when migrating pages. The distribution of this hotspot is as follows: - 18.75% compact_zone - 17.39% migrate_pages - 13.79% migrate_pages_batch - 11.66% migrate_folio_move - 7.02% lru_add_drain + 7.02% lru_add_drain_cpu + 3.00% move_to_new_folio 1.23% rmap_walk + 1.92% migrate_folio_unmap + 3.20% migrate_pages_sync + 0.90% isolate_migratepages The lru_add_drain() was added by commit c3096e6782b7 ("mm/migrate: __unmap_and_move() push good newpage to LRU") to drain the newpage to LRU immediately, to help to build up the correct newpage->mlock_count in remove_migration_ptes() for mlocked pages. However, if there are no mlocked pages are migrating, then we can avoid this lru drain operation, especailly for the heavy concurrent scenarios. So we can record the source pages' mlocked status in migrate_folio_unmap(), and only drain the lru list when the mlocked status is set in migrate_folio_move(). In addition, the page was already isolated from lru when migrating, so checking the mlocked status is stable by folio_test_mlocked() in migrate_folio_unmap(). After this patch, I can see the hotpot of the lru_add_drain() is gone: - 9.41% migrate_pages_batch - 6.15% migrate_folio_move - 3.64% move_to_new_folio + 1.80% migrate_folio_extra + 1.70% buffer_migrate_folio + 1.41% rmap_walk + 0.62% folio_add_lru + 3.07% migrate_folio_unmap Meanwhile, the compaction latency shows some improvements when running thpscale: base patched Amean fault-both-1 1131.22 ( 0.00%) 1112.55 * 1.65%* Amean fault-both-3 2489.75 ( 0.00%) 2324.15 * 6.65%* Amean fault-both-5 3257.37 ( 0.00%) 3183.18 * 2.28%* Amean fault-both-7 4257.99 ( 0.00%) 4079.04 * 4.20%* Amean fault-both-12 6614.02 ( 0.00%) 6075.60 * 8.14%* Amean fault-both-18 10607.78 ( 0.00%) 8978.86 * 15.36%* Amean fault-both-24 14911.65 ( 0.00%) 11619.55 * 22.08%* Amean fault-both-30 14954.67 ( 0.00%) 14925.66 * 0.19%* Amean fault-both-32 16654.87 ( 0.00%) 15580.31 * 6.45%* Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/migrate: add nr_split to trace_mm_migrate_pages stats.Zi Yan
Add nr_split to trace_mm_migrate_pages for large folio (including THP) split events. [akpm@linux-foundation.org: cleanup per Huang, Ying] Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/migrate: correct nr_failed in migrate_pages_sync()Zi Yan
nr_failed was missing the large folio splits from migrate_pages_batch() and can cause a mismatch between migrate_pages() return value and the number of not migrated pages, i.e., when the return value of migrate_pages() is 0, there are still pages left in the from page list. It will happen when a non-PMD THP large folio fails to migrate due to -ENOMEM and is split successfully but not all the split pages are not migrated, migrate_pages_batch() would return non-zero, but astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller of migrate_pages(), but the not migrated pages are left in the from page list without being added back to LRU lists. Fix it by adding a new nr_split counter for large folio splits and adding it to nr_failed in migrate_page_sync() after migrate_pages_batch() is done. Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also directly use folio_nid() instead of page_to_nid(&folio->page). Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18hugetlb: memcg: account hugetlb-backed memory in memory controllerNhat Pham
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg. Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on). Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18mm/migrate: remove unused mm argument from do_move_pages_to_nodeGregory Price
This function does not actively use the mm_struct, it can be removed. Link: https://lkml.kernel.org/r/20231003144857.752952-2-gregory.price@memverge.com Signed-off-by: Gregory Price <gregory.price@memverge.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Gregory Price <gregory.price@memverge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.Andrew Morton
2023-10-16mm/filemap: remove hugetlb special casing in filemap.cSidhartha Kumar
Remove special cased hugetlb handling code within the page cache by changing the granularity of ->index to the base page size rather than the huge page size. The motivation of this patch is to reduce complexity within the filemap code while also increasing performance by removing branches that are evaluated on every page cache lookup. To support the change in index, new wrappers for hugetlb page cache interactions are added. These wrappers perform the conversion to a linear index which is now expected by the page cache for huge pages. ========================= PERFORMANCE ====================================== Perf was used to check the performance differences after the patch. Overall the performance is similar to mainline with a very small larger overhead that occurs in __filemap_add_folio() and hugetlb_add_to_page_cache(). This is because of the larger overhead that occurs in xa_load() and xa_store() as the xarray is now using more entries to store hugetlb folios in the page cache. Timing aarch64 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages]# time fallocate -l 700GB test.txt real 1m49.568s user 0m0.000s sys 1m49.461s 6.5-rc3: [root]# time fallocate -l 700GB test.txt real 1m47.495s user 0m0.000s sys 1m47.370s 1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m47.024s user 0m0.000s sys 1m46.921s 6.5-rc3: [root@sidhakum-ol9-1 hugepages1G]# time fallocate -l 700GB test.txt real 1m44.551s user 0m0.000s sys 1m44.438s x86 2MB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages]# time fallocate -l 100GB test.txt real 0m22.383s user 0m0.000s sys 0m22.255s 6.5-rc3: [opc@sidhakum-ol9-2 hugepages]$ time sudo fallocate -l 100GB /dev/hugepages/test.txt real 0m22.735s user 0m0.038s sys 0m22.567s 1GB Page Size 6.5-rc3 + this patch: [root@sidhakum-ol9-2 hugepages1GB]# time fallocate -l 100GB test.txt real 0m25.786s user 0m0.001s sys 0m25.589s 6.5-rc3: [root@sidhakum-ol9-2 hugepages1G]# time fallocate -l 100GB test.txt real 0m33.454s user 0m0.001s sys 0m33.193s aarch64: workload - fallocate a 700GB file backed by huge pages 6.5-rc3 + this patch: 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate | |--95.04%--__pi_clear_page | |--3.57%--clear_huge_page | | | |--2.63%--rcu_all_qs | | | --0.91%--__cond_resched | --0.67%--__cond_resched 0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache 0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio 6.5-rc3 2MB Page Size: --100.00%--__arm64_sys_fallocate ksys_fallocate vfs_fallocate hugetlbfs_fallocate | |--94.91%--__pi_clear_page | |--4.11%--clear_huge_page | | | |--3.00%--rcu_all_qs | | | --1.10%--__cond_resched | --0.59%--__cond_resched 0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio x86 workload - fallocate a 100GB file backed by huge pages 6.5-rc3 + this patch: 2MB Page Size: hugetlbfs_fallocate | --99.57%--clear_huge_page | --98.47%--clear_page_erms | --0.53%--asm_sysvec_apic_timer_interrupt 0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store 6.5-rc3 2MB Page Size: --99.93%--__x64_sys_fallocate vfs_fallocate hugetlbfs_fallocate | --99.38%--clear_huge_page | |--98.40%--clear_page_erms | --0.59%--__cond_resched 0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio ========================= TESTING ====================================== This patch passes libhugetlbfs tests and LTP hugetlb tests ********** TEST SUMMARY * 2M * 32-bit 64-bit * Total testcases: 110 113 * Skipped: 0 0 * PASS: 107 113 * FAIL: 0 0 * Killed by signal: 3 0 * Bad configuration: 0 0 * Expected FAIL: 0 0 * Unexpected PASS: 0 0 * Test not present: 0 0 * Strange test result: 0 0 ********** Done executing testcases. LTP Version: 20220527-178-g2761a81c4 page migration was also tested using Mike Kravetz's test program.[8] [dan.carpenter@linaro.org: fix an NULL vs IS_ERR() bug] Link: https://lkml.kernel.org/r/1772c296-1417-486f-8eef-171af2192681@moroto.mountain Link: https://lkml.kernel.org/r/20230926192017.98183-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reported-and-tested-by: syzbot+c225dea486da4d5592bd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c225dea486da4d5592bd Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06mm/migrate: fix do_pages_move for compat pointersGregory Price
do_pages_move does not handle compat pointers for the page list. correctly. Add in_compat_syscall check and appropriate get_user fetch when iterating the page list. It makes the syscall in compat mode (32-bit userspace, 64-bit kernel) work the same way as the native 32-bit syscall again, restoring the behavior before my broken commit 5b1b561ba73c ("mm: simplify compat_sys_move_pages"). More specifically, my patch moved the parsing of the 'pages' array from the main entry point into do_pages_stat(), which left the syscall working correctly for the 'stat' operation (nodes = NULL), while the 'move' operation (nodes != NULL) is now missing the conversion and interprets 'pages' as an array of 64-bit pointers instead of the intended 32-bit userspace pointers. It is possible that nobody noticed this bug because the few applications that actually call move_pages are unlikely to run in compat mode because of their large memory requirements, but this clearly fixes a user-visible regression and should have been caught by ltp. Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages") Signed-off-by: Gregory Price <gregory.price@memverge.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: remove isolated variable in add_page_for_migration()Kefeng Wang
Directly check the return of isolate_hugetlb() and folio_isolate_lru() to remove isolated variable, also setup err = -EBUSY in advance before isolation, and update err only when successfully queued for migration, which could help us to unify and simplify code a bit. Link: https://lkml.kernel.org/r/20230913095131.2426871-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: remove PageHead() check for HugeTLB in add_page_for_migration()Kefeng Wang
There is some different between hugeTLB and THP behave when passed the address of a tail page, for THP, it will migrate the entire THP page, but for HugeTLB, it will return -EACCES, or -ENOENT before commit e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()"), -EACCES The page is mapped by multiple processes and can be moved only if MPOL_MF_MOVE_ALL is specified. -ENOENT The page is not present. But when check manual[1], both of the two errnos are not suitable, it is better to keep the same behave between hugetlb and THP when passed the address of a tail page, so let's just remove the PageHead() check for HugeTLB. [1] https://man7.org/linux/man-pages/man2/move_pages.2.html Link: https://lkml.kernel.org/r/20230913095131.2426871-8-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: use a folio in add_page_for_migration()Kefeng Wang
Use a folio in add_page_for_migration() to save compound_head() calls. Link: https://lkml.kernel.org/r/20230913095131.2426871-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: use __folio_test_movable()Kefeng Wang
Use __folio_test_movable(), no need to convert from folio to page again. Link: https://lkml.kernel.org/r/20230913095131.2426871-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()Kefeng Wang
At present, numa balance only support base page and PMD-mapped THP, but we will expand to support to migrate large folio/pte-mapped THP in the future, it is better to make migrate_misplaced_page() to take a folio instead of a page, and rename it to migrate_misplaced_folio(), it is a preparation, also this remove several compound_head() calls. Link: https://lkml.kernel.org/r/20230913095131.2426871-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: convert numamigrate_isolate_page() to numamigrate_isolate_folio()Kefeng Wang
Rename numamigrate_isolate_page() to numamigrate_isolate_folio(), then make it takes a folio and use folio API to save compound_head() calls. Link: https://lkml.kernel.org/r/20230913095131.2426871-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: remove THP mapcount check in numamigrate_isolate_page()Kefeng Wang
The check of THP mapped by multiple processes was introduced by commit 04fa5d6a6547 ("mm: migrate: check page_count of THP before migrating") and refactor by commit 340ef3902cf2 ("mm: numa: cleanup flow of transhuge page migration"), which is out of date, since migrate_misplaced_page() is now using the standard migrate_pages() for small pages and THPs, the reference count checking is in folio_migrate_mapping(), so let's remove the special check for THP. Link: https://lkml.kernel.org/r/20230913095131.2426871-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm: migrate: remove PageTransHuge check in numamigrate_isolate_page()Kefeng Wang
Patch series "mm: migrate: more folio conversion and unification", v3. Convert more migrate functions to use a folio, it is also a preparation for large folio migration support when balancing numa. This patch (of 8): The assert VM_BUG_ON_PAGE(order && !PageTransHuge(page), page) is not very useful, 1) for a tail/base page, order = 0, for a head page, the order > 0 && PageTransHuge() is true 2) there is a PageCompound() check and only base page is handled in do_numa_page(), and do_huge_pmd_numa_page() only handle PMD-mapped THP 3) even though the page is a tail page, isolate_lru_page() will post a warning, and fail to isolate the page 4) if large folio/pte-mapped THP migration supported in the future, we could migrate the entire folio if numa fault on a tail page so just remove the check. Link: https://lkml.kernel.org/r/20230913095131.2426871-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230913095131.2426871-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04mm/rmap: pass folio to hugepage_add_anon_rmap()David Hildenbrand
Let's pass a folio; we are always mapping the entire thing. Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29mm: hugetlb: add huge page size param to set_huge_pte_at()Ryan Roberts
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-18migrate: use folio_set_bh() instead of set_bh_page()Matthew Wilcox (Oracle)
This function was converted before folio_set_bh() existed. Catch up to the new API. Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Tom Rix <trix@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: merge folio_has_private()/filemap_release_folio() call pairsDavid Howells
Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-11mm: Make pte_mkwrite() take a VMARick Edgecombe
The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted callers that don't have a VMA were to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com
2023-06-23mm: remove unnecessary pagevec includesMatthew Wilcox (Oracle)
These files no longer need pagevec.h, mostly due to function declarations being moved out of it. Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-23mm: fix shmem THP counters on migrationJan Glauber
The per node numa_stat values for shmem don't change on page migration for THP: grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat: shmem N0=1092616192 N1=10485760 shmem_thp N0=1092616192 N1=10485760 migratepages 9181 0 1: shmem N0=0 N1=1103101952 shmem_thp N0=1092616192 N1=10485760 Fix that by updating shmem_thp counters likewise to shmem counters on page migration. [jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge] Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com Signed-off-by: Jan Glauber <jglauber@digitalocean.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: ptep_get() conversionRyan Roberts
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/various: give up if pte_offset_map[_lock]() failsHugh Dickins
Following the examples of nearby code, various functions can just give up if pte_offset_map() or pte_offset_map_lock() fails. And there's no need for a preliminary pmd_trans_unstable() or other such check, since such cases are now safely handled inside. Link: https://lkml.kernel.org/r/7b9bd85d-1652-cbf2-159d-f503b45e5b@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/migrate: remove cruft from migration_entry_wait()sHugh Dickins
migration_entry_wait_on_locked() does not need to take a mapped pte pointer, its callers can do the unmap first. Annotate it with __releases(ptl) to reduce sparse warnings. Fold __migration_entry_wait_huge() into migration_entry_wait_huge(). Fold __migration_entry_wait() into migration_entry_wait(), preferring the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr(). Link: https://lkml.kernel.org/r/b0e2a532-cdf2-561b-e999-f3b13b8d6d3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm: convert migrate_pages() to work on foliosMatthew Wilcox (Oracle)
Almost all of the callers & implementors of migrate_pages() were already converted to use folios. compaction_alloc() & compaction_free() are trivial to convert a part of this patch and not worth splitting out. Link: https://lkml.kernel.org/r/20230513001101.276972-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09migrate_pages_batch: simplify retrying and failure counting of large foliosHuang Ying
After recent changes to the retrying and failure counting in migrate_pages_batch(), it was found that it's unnecessary to count retrying and failure for normal, large, and THP folios separately. Because we don't use retrying and failure number of large folios directly. So, in this patch, we simplified retrying and failure counting of large folios via counting retrying and failure of normal and large folios together. This results in the reduced line number. Previously, in migrate_pages_batch we need to track whether the source folio is large/THP before splitting. So is_large is used to cache folio_test_large() result. Now, we don't need that variable any more because we don't count retrying and failure of large folios (only counting that of THP folios). So, in this patch, is_large is removed to simplify the code. This is just code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09migrate_pages: avoid blocking for IO in MIGRATE_SYNC_LIGHTDouglas Anderson
The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal. When putting a Chromebook under memory pressure (opening over 90 tabs on a 4GB machine) it was fairly easy to see delays waiting for some locks in the kcompactd code path of > 100 ms. While the laptop wasn't amazingly usable in this state, it was still limping along and this state isn't something artificial. Sometimes we simply end up with a lot of memory pressure. Putting the same Chromebook under memory pressure while it was running Android apps (though not stressing them) showed a much worse result (NOTE: this was on a older kernel but the codepaths here are similar). Android apps on ChromeOS currently run from a 128K-block, zlib-compressed, loopback-mounted squashfs disk. If we get a page fault from something backed by the squashfs filesystem we could end up holding a folio lock while reading enough from disk to decompress 128K (and then decompressing it using the somewhat slow zlib algorithms). That reading goes through the ext4 subsystem (because it's a loopback mount) before eventually ending up in the block subsystem. This extra jaunt adds extra overhead. Without much work I could see cases where we ended up blocked on a folio lock for over a second. With more extreme memory pressure I could see up to 25 seconds. We considered adding a timeout in the case of MIGRATE_SYNC_LIGHT for the two locks that were seen to be slow [1] and that generated much discussion. After discussion, it was decided that we should avoid waiting for the two locks during MIGRATE_SYNC_LIGHT if they were being held for IO. We'll continue with the unbounded wait for the more full SYNC modes. With this change, I couldn't see any slow waits on these locks with my previous testcases. NOTE: The reason I stated digging into this originally isn't because some benchmark had gone awry, but because we've received in-the-field crash reports where we have a hung task waiting on the page lock (which is the equivalent code path on old kernels). While the root cause of those crashes is likely unrelated and won't be fixed by this patch, analyzing those crash reports did point out these very long waits seemed like something good to fix. With this patch we should no longer hang waiting on these locks, but presumably the system will still be in a bad shape and hang somewhere else. [1] https://lore.kernel.org/r/20230421151135.v2.1.I2b71e11264c5c214bc59744b9e13e4c353bc5714@changeid Link: https://lkml.kernel.org/r/20230428135414.v3.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-21mm: don't check VMA write permissions if the PTE/PMD indicates write permissionsDavid Hildenbrand
Staring at the comment "Recheck VMA as permissions can change since migration started" in remove_migration_pte() can result in confusion, because if the source PTE/PMD indicates write permissions, then there should be no need to check VMA write permissions when restoring migration entries or PTE-mapping a PMD. Commit d3cb8bf6081b ("mm: migrate: Close race between migration completion and mprotect") introduced the maybe_mkwrite() handling in remove_migration_pte() in 2014, stating that a race between mprotect() and migration finishing would be possible, and that we could end up with a writable PTE that should be readable. However, mprotect() code first updates vma->vm_flags / vma->vm_page_prot and then walks the page tables to (a) set all present writable PTEs to read-only and (b) convert all writable migration entries to readable migration entries. While walking the page tables and modifying the entries, migration code has to grab the PT locks to synchronize against concurrent page table modifications. Assuming migration would find a writable migration entry (while holding the PT lock) and replace it with a writable present PTE, surely mprotect() code didn't stumble over the writable migration entry yet (converting it into a readable migration entry) and would instead wait for the PT lock to convert the now present writable PTE into a read-only PTE. As mprotect() didn't finish yet, the behavior is just like migration didn't happen: a writable PTE will be converted to a read-only PTE. So it's fine to rely on the writability information in the source PTE/PMD and not recheck against the VMA as long as we're holding the PT lock to synchronize with anyone who concurrently wants to downgrade write permissions (like mprotect()) by first adjusting vma->vm_flags / vma->vm_page_prot to then walk over the page tables to adjust the page table entries. Running test cases that should reveal such races -- mprotect(PROT_READ) racing with page migration or THP splitting -- for multiple hours did not reveal an issue with this cleanup. Link: https://lkml.kernel.org/r/20230418142113.439494-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21migrate_pages_batch: fix statistics for longterm pin retryHuang Ying
In commit fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt"), if the THP splitting fails due to page reference count, we will retry to improve migration successful rate. But the failed splitting is counted as migration failure and migration retry, which will cause duplicated failure counting. So, in this patch, this is fixed via undoing the failure counting if we decide to retry. The patch is tested via failure injection. Link: https://lkml.kernel.org/r/20230416235929.1040194-1-ying.huang@intel.com Fixes: fd4a7ac32918 ("mm: migrate: try again if THP split is failed due to page refcnt") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm/migrate: revert "mm/migrate: fix wrongly apply write bit after mkdirty on ↵David Hildenbrand
sparc64" This reverts commit 96a9c287e25d ("mm/migrate: fix wrongly apply write bit after mkdirty on sparc64"). Now that sparc64 mkdirty handling is fixed and no longer sets a PTE/PMD writable that shouldn't be writable, let's revert the temporary fix. The mkdirty mm selftest still passes with this change on sparc64. Note that loongarch handling was fixed in commit bf2f34a506e6 ("LoongArch: Set _PAGE_DIRTY only if _PAGE_WRITE is set in {pmd,pte}_mkdirty()"). Link: https://lkml.kernel.org/r/20230411142512.438404-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28mm/migrate: drop pte_mkhuge() in remove_migration_pte()Anshuman Khandual
Since the following commit, arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). This just drops pte_mkhuge() from remove_migration_pte(), which has now become redundant. 'commit 16785bd77431 ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-16mm: Introduce untagged_addr_remote()Kirill A. Shutemov
untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com
2023-03-07migrate_pages: try migrate in batch asynchronously firstlyHuang Ying
When we have locked more than one folios, we cannot wait the lock or bit (e.g., page lock, buffer head lock, writeback bit) synchronously. Otherwise deadlock may be triggered. This make it hard to batch the synchronous migration directly. This patch re-enables batching synchronous migration via trying to migrate in batch asynchronously firstly. And any folios that are failed to be migrated asynchronously will be migrated synchronously one by one. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07migrate_pages: move split folios processing out of migrate_pages_batch()Huang Ying
To simplify the code logic and reduce the line number. Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-07migrate_pages: fix deadlock in batched migrationHuang Ying
Patch series "migrate_pages: fix deadlock in batched synchronous migration", v2. Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei. Analysis shows that if we have locked some other folios except the one we are migrating, it's not safe in general to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. So 1/3 fixes the deadlock in a simple way, where the batching support for the synchronous migration is disabled. The change is straightforward and easy to be understood. While 3/3 re-introduce the batching for synchronous migration via trying to migrate asynchronously in batch optimistically, then fall back to migrate synchronously one by one for fail-to-migrate folios. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. This patch (of 3): Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei! For example, in the following deadlock trace snippet, INFO: task kworker/u4:0:9 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 Workqueue: loop4 loop_rootcg_workfn Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? __pfx_wake_page_function+0x10/0x10 __filemap_get_folio+0x73d/0x770 shmem_get_folio_gfp+0x1fd/0xc80 shmem_write_begin+0x91/0x220 generic_perform_write+0x10e/0x2e0 __generic_file_write_iter+0x17e/0x290 ? generic_write_checks+0x12b/0x1a0 generic_file_write_iter+0x97/0x180 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_readv_writev+0x13c/0x210 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_write+0xf6/0x330 vfs_iter_write+0x46/0x70 loop_process_work+0x723/0xfe0 loop_rootcg_workfn+0x28/0x40 process_one_work+0x3cc/0x8d0 worker_thread+0x66/0x630 ? __pfx_worker_thread+0x10/0x10 kthread+0x153/0x190 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> INFO: task repro:1023 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? compaction_alloc+0x77/0x1150 ? __pfx_wake_page_function+0x10/0x10 folio_wait_bit+0x30/0x40 folio_wait_writeback+0x2e/0x1e0 migrate_pages_batch+0x555/0x1ac0 ? __pfx_compaction_alloc+0x10/0x10 ? __pfx_compaction_free+0x10/0x10 ? __this_cpu_preempt_check+0x17/0x20 ? lock_is_held_type+0xe6/0x140 migrate_pages+0x100e/0x1180 ? __pfx_compaction_free+0x10/0x10 ? __pfx_compaction_alloc+0x10/0x10 compact_zone+0xe10/0x1b50 ? lock_is_held_type+0xe6/0x140 ? check_preemption_disabled+0x80/0xf0 compact_node+0xa3/0x100 ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 ? _find_first_bit+0x7b/0x90 sysctl_compaction_handler+0x5d/0xb0 proc_sys_call_handler+0x29d/0x420 proc_sys_write+0x2b/0x40 vfs_write+0x3a3/0x780 ksys_write+0xb7/0x180 __x64_sys_write+0x26/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f3a2471f59d RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 </TASK> The page migration task has held the lock of the shmem folio A, and is waiting the writeback of the folio B of the file system on the loop block device to complete. While the loop worker task which writes back the folio B is waiting to lock the shmem folio A, because the folio A backs the folio B in the loop device. Thus deadlock is triggered. In general, if we have locked some other folios except the one we are migrating, it's not safe to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. To fix the deadlock, in this patch, we avoid to batch the page migration except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting is avoided. The fix can be improved further. We will do that as soon as possible. Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/ Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-04mm: avoid gcc complaint about pointer castingLinus Torvalds
The migration code ends up temporarily stashing information of the wrong type in unused fields of the newly allocated destination folio. That all works fine, but gcc does complain about the pointer type mis-use: mm/migrate.c: In function ‘__migrate_folio_extract’: mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’ 1050 | *anon_vmap = (void *)dst->mapping; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ and gcc is actually right to complain since it really doesn't understand that this is a very temporary special case where this is ok. This could be fixed in different ways by just obfuscating the assignment sufficiently that gcc doesn't see what is going on, but the truly "proper C" way to do this is by explicitly using a union. Using unions for type conversions like this is normally hugely ugly and syntactically nasty, but this really is one of the few cases where we want to make it clear that we're not doing type conversion, we're really re-using the value bit-for-bit just using another type. IOW, this should not become a common pattern, but in this one case using that odd union is probably the best way to document to the compiler what is conceptually going on here. [ Side note: there are valid cases where we convert pointers to other pointer types, notably the whole "folio vs page" situation, where the types actually have fundamental commonalities. The fact that the gcc note is limited to just randomized structures means that we don't see equivalent warnings for those cases, but it migth also mean that we miss other cases where we do play these kinds of dodgy games, and this kind of explicit conversion might be a good idea. ] I verified that at least for an allmodconfig build on x86-64, this generates the exact same code, apart from line numbers and assembler comment changes. Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Cc: Huang, Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-20mm: change to return bool for isolate_movable_page()Baolin Wang
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users will care about the negative return value, thus we can convert the isolate_movable_page() to return a boolean value to make the code more clear when checking the movable page isolation state. No functional changes intended. [akpm@linux-foundation.org: remove unneeded comment, per Matthew] Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20mm: hugetlb: change to return bool for isolate_hugetlb()Baolin Wang
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not care about the negative value, thus we can convert the isolate_hugetlb() to return a boolean value to make code more clear when checking the hugetlb isolation state. Moreover converts 2 users which will consider the negative value returned by isolate_hugetlb(). No functional changes intended. [akpm@linux-foundation.org: shorten locked section, per SeongJae Park] Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20mm: change to return bool for isolate_lru_page()Baolin Wang
The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-17mm/migrate: fix wrongly apply write bit after mkdirty on sparc64Peter Xu
Nick Bowler reported another sparc64 breakage after the young/dirty persistent work for page migration (per "Link:" below). That's after a similar report [2]. It turns out page migration was overlooked, and it wasn't failing before because page migration was not enabled in the initial report test environment. David proposed another way [2] to fix this from sparc64 side, but that patch didn't land somehow. Neither did I check whether there's any other arch that has similar issues. Let's fix it for now as simple as moving the write bit handling to be after dirty, like what we did before. Note: this is based on mm-unstable, because the breakage was since 6.1 and we're at a very late stage of 6.2 (-rc8), so I assume for this specific case we should target this at 6.3. [1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ [2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/ Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com Fixes: 2e3468778dbe ("mm: remember young/dirty bit for page migrations") Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/ Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Nick Bowler <nbowler@draconx.ca> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nick Bowler <nbowler@draconx.ca> Cc: <regressions@lists.linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-16migrate_pages: move THP/hugetlb migration support check to simplify codeHuang Ying
This is a code cleanup patch, no functionality change is expected. After the change, the line number reduces especially in the long migrate_pages_batch(). Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>