summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-07-03kmsan: allow disabling KMSAN checks for the current taskIlya Leoshkevich
Like for KASAN, it's useful to temporarily disable KMSAN checks around, e.g., redzone accesses. Introduce kmsan_disable_current() and kmsan_enable_current(), which are similar to their KASAN counterparts. Make them reentrant in order to handle memory allocations in interrupt context. Repurpose the allow_reporting field for this. Link: https://lkml.kernel.org/r/20240621113706.315500-12-iii@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <kasan-dev@googlegroups.com> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03kmsan: expose kmsan_get_metadata()Ilya Leoshkevich
Each s390 CPU has lowcore pages associated with it. Each CPU sees its own lowcore at virtual address 0 through a hardware mechanism called prefixing. Additionally, all lowcores are mapped to non-0 virtual addresses stored in the lowcore_ptr[] array. When lowcore is accessed through virtual address 0, one needs to resolve metadata for lowcore_ptr[raw_smp_processor_id()]. Expose kmsan_get_metadata() to make it possible to do this from the arch code. Link: https://lkml.kernel.org/r/20240621113706.315500-10-iii@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <kasan-dev@googlegroups.com> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: read page_type using READ_ONCEDavid Hildenbrand
KCSAN complains about possible data races: while we check for a page_type -- for example for sanity checks -- we might concurrently modify the mapcount that overlays page_type. Let's use READ_ONCE to avoid load tearing (shouldn't make a difference) and to make KCSAN happy. Likely, we might also want to use WRITE_ONCE for the writer side of page_type, if KCSAN ever complains about that. But we'll not mess with that for now. Note: nothing should really be broken besides wrong KCSAN complaints. The sanity check that triggers this was added in commit 68f0320824fa ("mm/rmap: convert folio_add_file_rmap_range() into folio_add_file_rmap_[pte|ptes|pmd]()"). Even before that similar races likely where possible, ever since we added page_type in commit 6e292b9be7f4 ("mm: split page_type out from _mapcount"). Link: https://lkml.kernel.org/r/20240531125616.2850153-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-lkp@intel.com Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: memory: convert clear_huge_page() to folio_zero_user()Kefeng Wang
Patch series "mm: improve clear and copy user folio", v2. Some folio conversions. An improvement is to move address alignment into the caller as it is only needed if we don't know which address will be accessed when clearing/copying user folios. This patch (of 4): Replace clear_huge_page() with folio_zero_user(), and take a folio instead of a page. Directly get number of pages by folio_nr_pages() to remove pages_per_huge_page argument, furthermore, move the address alignment from folio_zero_user() to the callers since the alignment is only needed when we don't know which address will be accessed. Link: https://lkml.kernel.org/r/20240618091242.2140164-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240618091242.2140164-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: extend rmap flags arguments for folio_add_new_anon_rmapBarry Song
Patch series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()", v2. This patchset is preparatory work for mTHP swapin. folio_add_new_anon_rmap() assumes that new anon rmaps are always exclusive. However, this assumption doesn’t hold true for cases like do_swap_page(), where a new anon might be added to the swapcache and is not necessarily exclusive. The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to handle both exclusive and non-exclusive new anon folios. The do_swap_page() function is updated to use this extended API with rmap flags. Consequently, all new anon folios now consistently use folio_add_new_anon_rmap(). The special case for !folio_test_anon() in __folio_add_anon_rmap() can be safely removed. In conclusion, new anon folios always use folio_add_new_anon_rmap(), regardless of exclusivity. Old anon folios continue to use __folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and folio_add_anon_rmap_ptes(). This patch (of 3): In the case of a swap-in, a new anonymous folio is not necessarily exclusive. This patch updates the rmap flags to allow a new anonymous folio to be treated as either exclusive or non-exclusive. To maintain the existing behavior, we always use EXCLUSIVE as the default setting. [akpm@linux-foundation.org: cleanup and constifications per David and akpm] [v-songbaohua@oppo.com: fix missing doc for flags of folio_add_new_anon_rmap()] Link: https://lkml.kernel.org/r/20240619210641.62542-1-21cnbao@gmail.com [v-songbaohua@oppo.com: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap] Link: https://lkml.kernel.org/r/20240622030256.43775-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240617231137.80726-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240617231137.80726-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: Shuai Yuan <yuanshuai@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages ↵David Hildenbrand
when offlining We currently have a hack for virtio-mem in place to handle memory offlining with PageOffline pages for which we already adjusted the managed page count. Let's enlighten memory offlining code so we can get rid of that hack, and document the situation. Link: https://lkml.kernel.org/r/20240607090939.89524-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Marco Elver <elver@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() ↵David Hildenbrand
instead of PageReserved() We currently initialize the memmap such that PG_reserved is set and the refcount of the page is 1. In virtio-mem code, we have to manually clear that PG_reserved flag to make memory offlining with partially hotplugged memory blocks possible: has_unmovable_pages() would otherwise bail out on such pages. We want to avoid PG_reserved where possible and move to typed pages instead. Further, we want to further enlighten memory offlining code about PG_offline: offline pages in an online memory section. One example is handling managed page count adjustments in a cleaner way during memory offlining. So let's initialize the pages with PG_offline instead of PG_reserved. generic_online_page()->__free_pages_core() will now clear that flag before handing that memory to the buddy. Note that the page refcount is still 1 and would forbid offlining of such memory except when special care is take during GOING_OFFLINE as currently only implemented by virtio-mem. With this change, we can now get non-PageReserved() pages in the XEN balloon list. From what I can tell, that can already happen via decrease_reservation(), so that should be fine. HV-balloon should not really observe a change: partial online memory blocks still cannot get surprise-offlined, because the refcount of these PageOffline() pages is 1. Update virtio-mem, HV-balloon and XEN-balloon code to be aware that hotplugged pages are now PageOffline() instead of PageReserved() before they are handed over to the buddy. We'll leave the ZONE_DEVICE case alone for now. Note that self-hosted vmemmap pages will no longer be marked as reserved. This matches ordinary vmemmap pages allocated from the buddy during memory hotplug. Now, really only vmemmap pages allocated from memblock during early boot will be marked reserved. Existing PageReserved() checks seem to be handling all relevant cases correctly even after this change. Link: https://lkml.kernel.org/r/20240607090939.89524-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> [generic memory-hotplug bits] Cc: Alexander Potapenko <glider@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Marco Elver <elver@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove page_mkclean()Kefeng Wang
There are no more users of page_mkclean(), remove it and update the document and comment. Link: https://lkml.kernel.org/r/20240604114822.2089819-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove page_maybe_dma_pinned()Kefeng Wang
After the last user of page_maybe_dma_pinned() is converted to folio_maybe_dma_pinned(), remove page_maybe_dma_pinned() and update the document and comment. [wangkefeng.wang@huawei.com: fix pin_user_pages.rst underlining] Link: https://lkml.kernel.org/r/61b256c7-4989-44ec-83db-f34a1bd4be2d@huawei.com Link: https://lkml.kernel.org/r/20240604114822.2089819-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/mm_init: initialize page->_mapcount directly in __init_single_page()David Hildenbrand
Let's simply reinitialize the page->_mapcount directly. We can now get rid of page_mapcount_reset(). Link: https://lkml.kernel.org/r/20240529111904.2069608-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads] Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/zsmalloc: use a proper page typeDavid Hildenbrand
Let's clean it up: use a proper page type and store our data (offset into a page) in the lower 16 bit as documented. We won't be able to support 256 KiB base pages, which is acceptable. Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC. Based on this, we should do a proper "struct zsdesc" conversion, as proposed in [1]. This removes the last _mapcount/page_type offender. [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/ Link: https://lkml.kernel.org/r/20240529111904.2069608-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads] Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: allow reuse of the lower 16 bit of the page type with an actual typeDavid Hildenbrand
As long as the owner sets a page type first, we can allow reuse of the lower 16 bit: sufficient to store an offset into a 64 KiB page, which is the maximum base page size in *common* configurations (ignoring the 256 KiB variant). Restrict it to the head page. We'll use that for zsmalloc next, to set a proper type while still reusing that field to store information (offset into a base page) that cannot go elsewhere for now. Let's reserve the lower 16 bit for that purpose and for catching mapcount underflows, and let's reduce PAGE_TYPE_BASE to a single bit. Note that we will still have to overflow the mapcount quite a lot until we would actually indicate a valid page type. Start handing out the type bits from highest to lowest, to make it clearer how many bits for types we have left. Out of 15 bit we can use for types, we currently use 6. If we run out of bits before we have better typing (e.g., memdesc), we can always investigate storing a value instead [1]. [1] https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/ [akpm@linux-foundation.org: fix PG_hugetlb typo, per David] Link: https://lkml.kernel.org/r/20240529111904.2069608-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads] Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: update _mapcount and page_type documentationDavid Hildenbrand
Patch series "mm: page_type, zsmalloc and page_mapcount_reset()", v2. Wanting to remove the remaining abuser of _mapcount/page_type along with page_mapcount_reset(), I stumbled over zsmalloc, which is yet to be converted away from "struct page" [1]. Unfortunately, we cannot stop using the page_type field in zsmalloc code completely for its own purposes. All other fields in "struct page" are used one way or the other. Could we simply store a 2-byte offset value at the beginning of each page? Likely, but that will require a bit more work; and once we have memdesc we might want to move the offset in there (struct zsalloc?) again. ... but we can limit the abuse to 16 bit, glue it to a page type that must be set, and document it. page_has_type() will always successfully indicate such zsmalloc pages, and such zsmalloc pages only. We lose zsmalloc support for PAGE_SIZE > 64KB, which should be tolerable. We could use more bits from the page type, but 16 bit sounds like a good idea for now. So clarify the _mapcount/page_type documentation, use a proper page_type for zsmalloc, and remove page_mapcount_reset(). [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/ This patch (of 6): Let's make it clearer that _mapcount must no longer be used for own purposes, and how _mapcount and page_type behaves nowadays (also in the context of hugetlb folios, which are typed folios that will be mapped to user space). Move the documentation regarding "-1" over from page_mapcount_reset(), which we will remove next. Move "page_type" before "mapcount", to make it clearer what typed folios are. Link: https://lkml.kernel.org/r/20240529111904.2069608-1-david@redhat.com Link: https://lkml.kernel.org/r/20240529111904.2069608-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads] Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/core: implement DAMON context commit functionSeongJae Park
Implement functions for supporting online DAMON context level parameters update. The function receives two DAMON context structs. One is the struct that currently being used by a kdamond and therefore to be updated. The other one contains the parameters to be applied to the first one. The function applies the new parameters to the destination struct while keeping/updating the internal status and operation results. The function should be called from DAMON context-update-safe place, like DAMON callbacks. Link: https://lkml.kernel.org/r/20240618181809.82078-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/core: implement DAMOS quota goals online commit functionSeongJae Park
Patch series "mm/damon: introduce DAMON parameters online commit function". DAMON context struct (damon_ctx) contains user requests (parameters), internal status, and operation results. For flexible usages, DAMON API users are encouraged to manually manipulate the struct. That works well for simple use cases. However, it has turned out that it is not that simple at least for online parameters udpate. It is easy to forget properly maintaining internal status and operation results. Also, such manual manipulation for online tuning is implemented multiple times on DAMON API users including DAMON sysfs interface, DAMON_RECLAIM and DAMON_LRU_SORT. As a result, we have multiple sources of bugs for same problem. Actually we found and fixed a few bugs from online parameter updating of DAMON API users. Implement a function for online DAMON parameters update in core layer, and replace DAMON API users' manual manipulation code for the use case. The core layer function could still have bugs, but this change reduces the source of bugs for the problem to one place. This patch (of 12): Implement functions for supporting online DAMOS quota goals parameters update. The function receives two DAMOS quota structs. One is the struct that currently being used by a kdamond and therefore to be updated. The other one contains the parameters to be applied to the first one. The function applies the new parameters to the destination struct while keeping/updating the internal status. The function should be called from parameters-update safe place, like DAMON callbacks. Link: https://lkml.kernel.org/r/20240618181809.82078-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240618181809.82078-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotionHyeongtak Ji
This patch introduces DAMOS_MIGRATE_HOT action, which is similar to DAMOS_MIGRATE_COLD, but proritizes hot pages. It migrates pages inside the given region to the 'target_nid' NUMA node in the sysfs. Here is one of the example usage of this 'migrate_hot' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/<N> $ cat contexts/<N>/schemes/<N>/action migrate_hot $ echo 0 > contexts/<N>/schemes/<N>/target_nid $ echo commit > state $ numactl -p 2 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -------------- ------ ------ ------ ----- 701 (hot_cold) 501 0 601 1101 Link: https://lkml.kernel.org/r/20240614030010.751-7-honggyu.kim@sk.com Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotionHonggyu Kim
This patch introduces DAMOS_MIGRATE_COLD action, which is similar to DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs instead of swapping them out. The 'target_nid' sysfs knob informs the migration target node ID. Here is one of the example usage of this 'migrate_cold' action. $ cd /sys/kernel/mm/damon/admin/kdamonds/<N> $ cat contexts/<N>/schemes/<N>/action migrate_cold $ echo 2 > contexts/<N>/schemes/<N>/target_nid $ echo commit > state $ numactl -p 0 ./hot_cold 500M 600M & $ numastat -c -p hot_cold Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Total -------------- ------ ------ ------ ----- 701 (hot_cold) 501 0 601 1101 Since there are some common routines with pageout, many functions have similar logics between pageout and migrate cold. damon_pa_migrate_folio_list() is a minimized version of shrink_folio_list(). Link: https://lkml.kernel.org/r/20240614030010.751-6-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/migrate: add MR_DAMON to migrate_reasonHonggyu Kim
The current patch series introduces DAMON based migration across NUMA nodes so it'd be better to have a new migrate_reason in trace events. Link: https://lkml.kernel.org/r/20240614030010.751-5-honggyu.kim@sk.com Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/sysfs-schemes: add target_nid on sysfs-schemesHyeongtak Ji
This patch adds target_nid under /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/ The 'target_nid' can be used as the destination node for DAMOS actions such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches. [sj@kernel.org: document target_nid file] Link: https://lkml.kernel.org/r/20240618213630.84846-3-sj@kernel.org Link: https://lkml.kernel.org/r/20240614030010.751-4-honggyu.kim@sk.com Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Gregory Price <gregory.price@memverge.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: move some function declarations into internal.hMiaohe Lin
There are some functions only used inside mm. Move them into internal.h. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-11-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405251049.hxjwX7zO-lkp@intel.com/ Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: remove MF_MSG_SLABMiaohe Lin
Since commit 46df8e73a4a3 ("mm: free up PG_slab"), MF_MSG_SLAB becomes unused. Remove it. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb_cgroup: switch to the new cftypesXiu Jianfeng
The previous patch has already reconstructed the cftype attributes based on the templates and saved them in dfl_cftypes and legacy_cftypes. then remove the old procedure and switch to the new cftypes. Link: https://lkml.kernel.org/r/20240612092409.2027592-4-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb_cgroup: prepare cftypes based on templateXiu Jianfeng
Unlike other cgroup subsystems, the hugetlb cgroup does not provide a static array of cftype that explicitly displays the properties, handling functions, etc., of each file. Instead, it dynamically creates the attribute of cftypes based on the hstate during the startup procedure. This reduces the readability of the code. To fix this issue, introduce two templates of cftypes, and rebuild the attributes according to the hstate to make it ready to be added to cgroup framework. Link: https://lkml.kernel.org/r/20240612092409.2027592-3-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: kernel test robot <oliver.sang@intel.com> From: Xiu Jianfeng <xiujianfeng@huawei.com> Subject: mm/hugetlb_cgroup: register lockdep key for cftype Date: Tue, 18 Jun 2024 07:19:22 +0000 When CONFIG_DEBUG_LOCK_ALLOC is enabled, the following commands can trigger a bug, mount -t cgroup2 none /sys/fs/cgroup cd /sys/fs/cgroup echo "+hugetlb" > cgroup.subtree_control The log is as below: BUG: key ffff8880046d88d8 has not been registered! ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 3 PID: 226 at kernel/locking/lockdep.c:4945 lockdep_init_map_type+0x185/0x220 Modules linked in: CPU: 3 PID: 226 Comm: bash Not tainted 6.10.0-rc4-next-20240617-g76db4c64526c #544 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:lockdep_init_map_type+0x185/0x220 Code: 00 85 c0 0f 84 6c ff ff ff 8b 3d 6a d1 85 01 85 ff 0f 85 5e ff ff ff 48 c7 c6 21 99 4a 82 48 c7 c7 60 29 49 82 e8 3b 2e f5 RSP: 0018:ffffc9000083fc30 EFLAGS: 00000282 RAX: 0000000000000000 RBX: ffffffff828dd820 RCX: 0000000000000027 RDX: ffff88803cd9cac8 RSI: 0000000000000001 RDI: ffff88803cd9cac0 RBP: ffff88800674fbb0 R08: ffffffff828ce248 R09: 00000000ffffefff R10: ffffffff8285e260 R11: ffffffff828b8eb8 R12: ffff8880046d88d8 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880067281c0 FS: 00007f68601ea740(0000) GS:ffff88803cd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005614f3ebc740 CR3: 000000000773a000 CR4: 00000000000006f0 Call Trace: <TASK> ? __warn+0x77/0xd0 ? lockdep_init_map_type+0x185/0x220 ? report_bug+0x189/0x1a0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? lockdep_init_map_type+0x185/0x220 __kernfs_create_file+0x79/0x100 cgroup_addrm_files+0x163/0x380 ? find_held_lock+0x2b/0x80 ? find_held_lock+0x2b/0x80 ? find_held_lock+0x2b/0x80 css_populate_dir+0x73/0x180 cgroup_apply_control_enable+0x12f/0x3a0 cgroup_subtree_control_write+0x30b/0x440 kernfs_fop_write_iter+0x13a/0x1f0 vfs_write+0x341/0x450 ksys_write+0x64/0xe0 do_syscall_64+0x4b/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f68602d9833 Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 08 RSP: 002b:00007fff9bbdf8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f68602d9833 RDX: 0000000000000009 RSI: 00005614f3ebc740 RDI: 0000000000000001 RBP: 00005614f3ebc740 R08: 000000000000000a R09: 0000000000000008 R10: 00005614f3db6ba0 R11: 0000000000000246 R12: 0000000000000009 R13: 00007f68603bd6a0 R14: 0000000000000009 R15: 00007f68603b8880 For lockdep, there is a sanity check in lockdep_init_map_type(), the lock-class key must either have been allocated statically or must have been registered as a dynamic key. However the commit e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template") has changed the cftypes from static allocated objects to dynamic allocated objects, so the cft->lockdep_key must be registered proactively. [xiujianfeng@huawei.com: fix BUG()] Link: https://lkml.kernel.org/r/20240619015527.2212698-1-xiujianfeng@huawei.com Link: https://lkml.kernel.org/r/20240618071922.2127289-1-xiujianfeng@huawei.com Link: https://lore.kernel.org/all/602186b3-5ce3-41b3-90a3-134792cc2a48@samsung.com/ Fixes: e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202406181046.8d8b2492-oliver.sang@intel.com Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: SeongJae Park <sj@kernel.org> Closes: https://lore.kernel.org/20240618233608.400367-1-sj@kernel.org Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: report per-page metadata informationSourav Panda
Today, we do not have any observability of per-page metadata and how much it takes away from the machine capacity. Thus, we want to describe the amount of memory that is going towards per-page metadata, which can vary depending on build configuration, machine architecture, and system use. This patch adds 2 fields to /proc/vmstat that can used as shown below: Accounting per-page metadata allocated by boot-allocator: /proc/vmstat:nr_memmap_boot * PAGE_SIZE Accounting per-page metadata allocated by buddy-allocator: /proc/vmstat:nr_memmap * PAGE_SIZE Accounting total Perpage metadata allocated on the machine: (/proc/vmstat:nr_memmap_boot + /proc/vmstat:nr_memmap) * PAGE_SIZE Utility for userspace: Observability: Describe the amount of memory overhead that is going to per-page metadata on the system at any given time since this overhead is not currently observable. Debugging: Tracking the changes or absolute value in struct pages can help detect anomalies as they can be correlated with other metrics in the machine (e.g., memtotal, number of huge pages, etc). page_ext overheads: Some kernel features such as page_owner page_table_check that use page_ext can be optionally enabled via kernel parameters. Having the total per-page metadata information helps users precisely measure impact. Furthermore, page-metadata metrics will reflect the amount of struct pages reliquished (or overhead reduced) when hugetlbfs pages are reserved which will vary depending on whether hugetlb vmemmap optimization is enabled or not. For background and results see: lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com Link: https://lkml.kernel.org/r/20240605222751.1406125-1-souravpanda@google.com Signed-off-by: Sourav Panda <souravpanda@google.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Chen Linxuan <chenlinxuan@uniontech.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tomas Mudrunka <tomas.mudrunka@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: zswap: add zswap_never_enabled()Yosry Ahmed
Add zswap_never_enabled() to skip the xarray lookup in zswap_load() if zswap was never enabled on the system. It is implemented using static branches for efficiency, as enabling zswap should be a rare event. This could shave some cycles off zswap_load() when CONFIG_ZSWAP is used but zswap is never enabled. However, the real motivation behind this patch is two-fold: - Incoming large folio swapin work will need to fallback to order-0 folios if zswap was ever enabled, because any part of the folio could be in zswap, until proper handling of large folios with zswap is added. - A warning and recovery attempt will be added in a following change in case the above was not done incorrectly. Zswap will fail the read if the folio is large and it was ever enabled. Expose zswap_never_enabled() in the header for the swapin work to use it later. [yosryahmed@google.com: expose zswap_never_enabled() in the header] Link: https://lkml.kernel.org/r/Zmjf0Dr8s9xSW41X@google.com Link: https://lkml.kernel.org/r/20240611024516.1375191-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: zswap: rename is_zswap_enabled() to zswap_is_enabled()Yosry Ahmed
In preparation for introducing a similar function, rename is_zswap_enabled() to use zswap_* prefix like other zswap functions. Link: https://lkml.kernel.org/r/20240611024516.1375191-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/vmscan: avoid split lazyfree THP during shrink_folio_list()Lance Yang
When the user no longer requires the pages, they would use madvise(MADV_FREE) to mark the pages as lazy free. Subsequently, they typically would not re-write to that memory again. During memory reclaim, if we detect that the large folio and its PMD are both still marked as clean and there are no unexpected references (such as GUP), so we can just discard the memory lazily, improving the efficiency of memory reclamation in this case. On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using mem_cgroup_force_empty() results in the following runtimes in seconds (shorter is better): -------------------------------------------- | Old | New | Change | -------------------------------------------- | 0.683426 | 0.049197 | -92.80% | -------------------------------------------- [ioworker0@gmail.com: minor changes per David] Link: https://lkml.kernel.org/r/20240622100057.3352-1-ioworker0@gmail.com Link: https://lkml.kernel.org/r/20240614015138.31461-4-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Fangrui Song <maskray@google.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/rmap: integrate PMD-mapped folio splitting into pagewalk loopLance Yang
In preparation for supporting try_to_unmap_one() to unmap PMD-mapped folios, start the pagewalk first, then call split_huge_pmd_address() to split the folio. Link: https://lkml.kernel.org/r/20240614015138.31461-3-ioworker0@gmail.com Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Bang Li <libang.li@antgroup.com> Cc: Barry Song <baohua@kernel.org> Cc: Fangrui Song <maskray@google.com> Cc: Jeff Xie <xiehuan09@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/highmem: make nr_free_highpages() return "unsigned long"David Hildenbrand
It looks rather weird that totalhigh_pages() returns an "unsigned long" but nr_free_highpages() returns an "unsigned int". Let's return an "unsigned long" from nr_free_highpages() to be consistent. While at it, use a plain "0" instead of a "0UL" in the !CONFIG_HIGHMEM totalhigh_pages() implementation, to make these look alike as well. Link: https://lkml.kernel.org/r/20240607083711.62833-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/highmem: reimplement totalhigh_pages() by walking zonesDavid Hildenbrand
Patch series "mm/highmem: don't track highmem pages manually". Let's remove highmem special-casing from adjust_managed_page_count(), to result in less confusion why memblock manually adjusts totalram_pages, and __free_pages_core() only adjusts the zone's managed pages -- what about the highmem pages that adjust_managed_page_count() updates? Now, we only maintain totalram_pages and a zone's managed pages independent of highmem support. We can derive the number of highmem pages simply by looking at the relevant zone's managed pages. I don't think there is any particular fast path that needs a maximum-efficient totalhigh_pages() implementation. Note that highmem memory is currently initialized using free_highmem_page()->free_reserved_page(), not __free_pages_core(). In the future we might want to also use __free_pages_core() to initialize highmem memory, to make that less special, and consider moving totalram_pages updates into __free_pages_core() [1], so we can just use adjust_managed_page_count() in there as well. Booting a simple kernel in QEMU reveals no highmem accounting change: Before: Memory: 3095448K/3145208K available (14802K kernel code, 2073K rwdata, 5000K rodata, 740K init, 556K bss, 49760K reserved, 0K cma-reserved, 2244488K highmem) After: Memory: 3095276K/3145208K available (14802K kernel code, 2073K rwdata, 5000K rodata, 740K init, 556K bss, 49932K reserved, 0K cma-reserved, 2244488K highmem) [1] https://lkml.kernel.org/r/20240601133402.2675-1-richard.weiyang@gmail.com This patch (of 2): Can we get rid of the highmem ifdef in adjust_managed_page_count()? Likely yes: we don't have that many totalhigh_pages() users, and they all don't seem to be very performance critical. So let's implement totalhigh_pages() like nr_free_highpages(), collecting information from all zones. This is now similar to what we do in si_meminfo_node() to collect the per-node highmem page count. In the common case (single node, 3-4 zones), we really shouldn't care. We could optimize a bit further (only walk ZONE_HIGHMEM and ZONE_MOVABLE if required), but there doesn't seem a real need for that. [david@redhat.com: fix build bot complaint] Link: https://lkml.kernel.org/r/b57e5bc4-eb72-40e3-add4-57dfa6e03df6@redhat.com Link: https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com Link: https://lkml.kernel.org/r/20240607083711.62833-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03fs/proc: move page_mapcount() to fs/proc/internal.hDavid Hildenbrand
... and rename it to folio_precise_page_mapcount(). fs/proc is the last remaining user, and that should stay that way. While at it, cleanup kpagecount_read() a bit: there are still some legacy leftovers -- when the interface was introduced it returned the page refcount, but was changed briefly afterwards to return the page mapcount. Further, some simple folio conversion. Once we stop using the per-page mapcounts of large folios, all folio_precise_page_mapcount() users will have to implement an alternative way to achieve what they are trying to achieve, possibly in a less precise way. [dan.carpenter@linaro.org: fix uninitialized variable in pagemap_pmd_range()] Link: https://lkml.kernel.org/r/9d6eaba7-92f8-4a70-8765-38a519680a87@moroto.mountain Link: https://lkml.kernel.org/r/20240607122357.115423-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Oscar Salvador <osalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: shmem: add mTHP counters for anonymous shmemBaolin Wang
Add mTHP counters for anonymous shmem. [baolin.wang@linux.alibaba.com: update Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/d86e2e7f-4141-432b-b2ba-c6691f36ef0b@linux.alibaba.com Link: https://lkml.kernel.org/r/4fd9e467d49ae4a747e428bcd821c7d13125ae67.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: shmem: add mTHP support for anonymous shmemBaolin Wang
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that can allow THP to be configured through the sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'. However, the anonymous shmem will ignore the anonymous mTHP rule configured through the sysfs interface, and can only use the PMD-mapped THP, that is not reasonable. Users expect to apply the mTHP rule for all anonymous pages, including the anonymous shmem, in order to enjoy the benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc. In addition, the mTHP interfaces can be extended to support all shmem/tmpfs scenarios in the future, especially for the shmem mmap() case. The primary strategy is similar to supporting anonymous mTHP. Introduce a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled', which can have almost the same values as the top-level '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new additional "inherit" option and dropping the testing options 'force' and 'deny'. By default all sizes will be set to "never" except PMD size, which is set to "inherit". This ensures backward compatibility with the anonymous shmem enabled of the top level, meanwhile also allows independent control of anonymous shmem enabled for each mTHP. Link: https://lkml.kernel.org/r/65796c1e72e51e15f3410195b5c2d5b6c160d411.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: shmem: add multi-size THP sysfs interface for anonymous shmemBaolin Wang
To support the use of mTHP with anonymous shmem, add a new sysfs interface 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/' directory for each mTHP to control whether shmem is enabled for that mTHP, with a value similar to the top level 'shmem_enabled', which can be set to: "always", "inherit (to inherit the top level setting)", "within_size", "advise", "never". An 'inherit' option is added to ensure compatibility with these global settings, and the options 'force' and 'deny' are dropped, which are rather testing artifacts from the old ages. By default, PMD-sized hugepages have enabled="inherit" and all other hugepage sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'. In addition, if top level value is 'force', then only PMD-sized hugepages have enabled="inherit", otherwise configuration will be failed and vice versa. That means now we will avoid using non-PMD sized THP to override the global huge allocation. [baolin.wang@linux.alibaba.com: fix transhuge.rst indentation] Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com [akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols] [baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS] Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com [akpm@linux-foundation.org: huge_memory.c needs mm_types.h] Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03percpu: add __this_cpu_try_cmpxchg()Uros Bizjak
Add __this_cpu_try_cmpxchg() version of the percpu op. Link: https://lkml.kernel.org/r/20240528144345.5980-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03memcg: rearrange fields of mem_cgroup_per_nodeShakeel Butt
Kernel test robot reported [1] performance regression for will-it-scale test suite's page_fault2 test case for the commit 70a64b7919cb ("memcg: dynamically allocate lruvec_stats"). After inspection it seems like the commit has unintentionally introduced false cache sharing. After the commit the fields of mem_cgroup_per_node which get read on the performance critical path share the cacheline with the fields which get updated often on LRU page allocations or deallocations. This has caused contention on that cacheline and the workloads which manipulates a lot of LRU pages are regressed as reported by the test report. The solution is to rearrange the fields of mem_cgroup_per_node such that the false sharing is eliminated. Let's move all the read only pointers at the start of the struct, followed by memcg-v1 only fields and at the end fields which get updated often. Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 and page_fault3 from the will-it-scale test suite inside a three level memcg with /tmp mounted as tmpfs on two different machines, one a single numa node and the other one, two node machine. $ ./[testcase]_processes -t $NR_CPUS -s 50 Results for single node, 52 CPU machine: Testcase base with-patch fallocate1 1031081 1431291 (38.80 %) fallocate2 1029993 1421421 (38.00 %) page_fault1 2269440 3405788 (50.07 %) page_fault2 2375799 3572868 (50.30 %) page_fault3 28641143 28673950 ( 0.11 %) Results for dual node, 80 CPU machine: Testcase base with-patch fallocate1 2976288 3641185 (22.33 %) fallocate2 2979366 3638181 (22.11 %) page_fault1 6221790 7748245 (24.53 %) page_fault2 6482854 7847698 (21.05 %) page_fault3 28804324 28991870 ( 0.65 %) Link: https://lkml.kernel.org/r/20240528164050.2625718-1-shakeel.butt@linux.dev Fixes: 70a64b7919cb ("memcg: dynamically allocate lruvec_stats") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Feng Tang <feng.tang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()Sidhartha Kumar
By using a folio in scan_movable_pages() we convert the last user of the page-based hugetlb information macro functions to the folio version. After this conversion, we can safely remove the page-based definitions from include/linux/hugetlb.h. Link: https://lkml.kernel.org/r/20240530171427.242018-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pagesBarry Song
Should do_swap_page() have the capability to directly map a large folio, metadata restoration becomes necessary for a specified number of pages denoted as nr. It's important to highlight that metadata restoration is solely required by the SPARC platform, which, however, does not enable THP_SWAP. Consequently, in the present kernel configuration, there exists no practical scenario where users necessitate the restoration of nr metadata. Platforms implementing THP_SWAP might invoke this function with nr values exceeding 1, subsequent to do_swap_page() successfully mapping an entire large folio. Nonetheless, their arch_do_swap_page_nr() functions remain empty. Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove the implementation of swap_free() and always use swap_free_nr()Barry Song
To streamline maintenance efforts, we propose removing the implementation of swap_free(). Instead, we can simply invoke swap_free_nr() with nr set to 1. swap_free_nr() is designed with a bitmap consisting of only one long, resulting in overhead that can be ignored for cases where nr equals 1. A prime candidate for leveraging swap_free_nr() lies within kernel/power/swap.c. Implementing this change facilitates the adoption of batch processing for hibernation. Link: https://lkml.kernel.org/r/20240529082824.150954-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: swap: introduce swap_free_nr() for batched swap_free()Chuanhua Han
Patch series "large folios swap-in: handle refault cases first", v5. This patchset is extracted from the large folio swapin series[1], primarily addressing the handling of scenarios involving large folios in the swap cache. Currently, it is particularly focused on addressing the refaulting of mTHP, which is still undergoing reclamation. This approach aims to streamline code review and expedite the integration of this segment into the MM tree. It relies on Ryan's swap-out series[2], leveraging the helper function swap_pte_batch() introduced by that series. Presently, do_swap_page only encounters a large folio in the swap cache before the large folio is released by vmscan. However, the code should remain equally useful once we support large folio swap-in via swapin_readahead(). This approach can effectively reduce page faults and eliminate most redundant checks and early exits for MTE restoration in recent MTE patchset[3]. The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead() will be split into separate patch sets and sent at a later time. [1] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/20240322114136.61386-1-21cnbao@gmail.com/ This patch (of 6): While swapping in a large folio, we need to free swaps related to the whole folio. To avoid frequently acquiring and releasing swap locks, it is better to introduce an API for batched free. Furthermore, this new function, swap_free_nr(), is designed to efficiently handle various scenarios for releasing a specified number, nr, of swap entries. Link: https://lkml.kernel.org/r/20240529082824.150954-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240529082824.150954-2-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove MIGRATE_SYNC_NO_COPY modeKefeng Wang
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous cleanup, it seems that we could remove the unnecessary MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate: remove migrate_folio_extra()Kefeng Wang
migrate_folio_extra() is only called in migrate.c now, convert it a static function and take a new src_private argument which could be shared by migrate_folio() and filemap_migrate_folio() to simplify code a bit. Link: https://lkml.kernel.org/r/20240524052843.182275-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03rmap: remove DEFINE_PAGE_VMA_WALK()Kefeng Wang
This are no users since commit 40d707f33db5 ("mm/ksm: use folio in write_protect_page"), so remove DEFINE_PAGE_VMA_WALK(). Link: https://lkml.kernel.org/r/20240524053618.208895-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove page_mapping()Matthew Wilcox (Oracle)
All callers are now converted, delete this compatibility wrapper. Also fix up some comments which referred to page_mapping. Link: https://lkml.kernel.org/r/20240423225552.4113447-7-willy@infradead.org Link: https://lkml.kernel.org/r/20240524181813.698813-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: memcontrol: remove page_memcg()Kefeng Wang
The page_memcg() only called by mod_memcg_page_state(), so squash it to cleanup page_memcg(). Link: https://lkml.kernel.org/r/20240524014950.187805-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/mm_init: use node's number of cpus in deferred_page_init_max_threadsEric Chanudet
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd096506922 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [echanude@redhat.com: v3] Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.com Signed-off-by: Eric Chanudet <echanude@redhat.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: improve memory failure action_result messagesJane Chu
Added two explicit MF_MSG messages describing failure in get_hwpoison_page. Attemped to document the definition of various action names, and made a few adjustment to the action_result() calls. Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: implement update_mmu_tlb() using update_mmu_tlb_range()Bang Li
Let's make update_mmu_tlb() simply a generic wrapper around update_mmu_tlb_range(). Only the latter can now be overridden by the architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well. Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: add update_mmu_tlb_range()Bang Li
Patch series "Add update_mmu_tlb_range() to simplify code", v4. This series of commits mainly adds the update_mmu_tlb_range() to batch update tlb in an address range and implement update_mmu_tlb() using update_mmu_tlb_range(). After commit 19eaf44954df ("mm: thp: support allocation of anonymous multi-size THP"), We may need to batch update tlb of a certain address range by calling update_mmu_tlb() in a loop. Using the update_mmu_tlb_range(), we can simplify the code and possibly reduce the execution of some unnecessary code in some architectures. This patch (of 3): Add update_mmu_tlb_range(), we can batch update tlb of an address range. Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/rmap: sanity check that zeropages are not passed to RMAPDavid Hildenbrand
Using insert_page() we might have previously ended up passing the zeropage into rmap code. Make sure that won't happen again. Note that we won't check the huge zeropage for now, which might still end up in RMAP code. Link: https://lkml.kernel.org/r/20240522125713.775114-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>