summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-07-03mm: remove MIGRATE_SYNC_NO_COPY modeKefeng Wang
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous cleanup, it seems that we could remove the unnecessary MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate: remove migrate_folio_extra()Kefeng Wang
migrate_folio_extra() is only called in migrate.c now, convert it a static function and take a new src_private argument which could be shared by migrate_folio() and filemap_migrate_folio() to simplify code a bit. Link: https://lkml.kernel.org/r/20240524052843.182275-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate_device: unify migrate folio for MIGRATE_SYNC_NO_COPYKefeng Wang
The __migrate_device_pages() won't copy page so MIGRATE_SYNC_NO_COPY passed into migrate_folio()/migrate_folio_extra(), actually a easy way is just to call folio_migrate_mapping()/folio_migrate_flags(), converting it to unify and simplify the migrate device pages, which also remove the only call for MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate_device: use a newfolio in __migrate_device_pages()Kefeng Wang
Use a newfolio instead of newpage and convert to more folio api in __migrate_device_pages(). Link: https://lkml.kernel.org/r/20240524052843.182275-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: migrate: simplify __buffer_migrate_folio()Kefeng Wang
Patch series "mm: cleanup MIGRATE_SYNC_NO_COPY mode". Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only used in hmm, a easy way is just to call the folio_migrate_mapping() and folio_migrate_flags(), which help to remove the MIGRATE_SYNC_NO_COPY mode. This patch (of 5): Use filemap_migrate_folio() helper to simplify __buffer_migrate_folio(). Link: https://lkml.kernel.org/r/20240524052843.182275-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240524052843.182275-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove page_mapping()Matthew Wilcox (Oracle)
All callers are now converted, delete this compatibility wrapper. Also fix up some comments which referred to page_mapping. Link: https://lkml.kernel.org/r/20240423225552.4113447-7-willy@infradead.org Link: https://lkml.kernel.org/r/20240524181813.698813-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: memcontrol: remove page_memcg()Kefeng Wang
The page_memcg() only called by mod_memcg_page_state(), so squash it to cleanup page_memcg(). Link: https://lkml.kernel.org/r/20240524014950.187805-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: use helper llist_for_each_entry()Yifei Li
Change the llist_for_each_entry_safe function to the llist_for_each_entry function and delete the next variable. Because the linked list is not modified,the llist_for_each_entry_safe function is not required. No functional changes are intended. Link: https://lkml.kernel.org/r/20240513075830.2611-1-liyifei28@huawei.com Signed-off-by: Yifei Li <liyifei28@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/zsmalloc: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/zsmalloc.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-4-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/kfence: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kfence/kfence_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-3-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/dmapool: add MODULE_DESCRIPTION()Jeff Johnson
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/dmapool_test.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-2-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hwpoison: add MODULE_DESCRIPTION()Jeff Johnson
Patch series "mm: add missing MODULE_DESCRIPTION() macros". This fixes the instances of "WARNING: modpost: missing MODULE_DESCRIPTION()" that I'm seeing in mm/. This patch (of 4): Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/hwpoison-inject.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-0-8c20e7d26842@quicinc.com Link: https://lkml.kernel.org/r/20240513-mm-md-v1-1-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/mm_init: use node's number of cpus in deferred_page_init_max_threadsEric Chanudet
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd096506922 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [echanude@redhat.com: v3] Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.com Signed-off-by: Eric Chanudet <echanude@redhat.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: batch unlink_file_vma calls in free_pgd_rangeMateusz Guzik
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on the i_mmap_rwsem semaphore, while the biggest singular contributor is free_pgd_range inducing the lock acquire back-to-back for all consecutive mappings of a given file. Tracing the count of said acquires while building the kernel shows: [1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [2, 3) 0 | | [3, 4) 3009 | | [4, 5) 3009 | | [5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ | So in particular there were 326442 opportunities to coalesce 5 acquires into 1. Doing so increases execs per second by 4% (~50k to ~52k) when running the benchmark linked below. The lock remains the main bottleneck, I have not looked at other spots yet. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -O2 -o shared-doexec doexec.c $ ./shared-doexec $(nproc) Note this particular test makes sure binaries are separate, but the loader is shared. Stats collected on the patched kernel (+ "noinline") with: bpftrace -e 'kprobe:unlink_file_vma_batch_process { @ = lhist(((struct unlink_vma_file_batch *)arg0)->count, 0, 8, 1); }' Link: https://lkml.kernel.org/r/20240521234321.359501-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: send SIGBUS in the event of thp split failJane Chu
While handling hwpoison in a THP page, it is possible that try_to_split_thp_page() fails. For example, when the THP page has been RDMA pinned. At this point, the kernel cannot isolate the poisoned THP page, all it could do is to send a SIGBUS to the user process with meaningful payload to give user-level recovery a chance. Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: move hwpoison_filter() higher upJane Chu
Move hwpoison_filter() higher up as there is no need to spend a lot cycles only to find out later that the page is supposed to be skipped from hwpoison handling. Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: improve memory failure action_result messagesJane Chu
Added two explicit MF_MSG messages describing failure in get_hwpoison_page. Attemped to document the definition of various action names, and made a few adjustment to the action_result() calls. Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)Jane Chu
The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a synchrous way in a sense, the injector is also a process under test, and should it have the poisoned page mapped in its address space, it should get killed as much as in a real UE situation. Doing so align with what the madvise(2) man page says: " "This operation may result in the calling process receiving a SIGBUS and the page being unmapped." Link: https://lkml.kernel.org/r/20240524215306.2705454-3-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <oalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory-failure: try to send SIGBUS even if unmap failedJane Chu
Patch series "Enhance soft hwpoison handling and injection", v4. This series is aimed at the following enhancements: - Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me we need a better simulation to real UE scenario. - For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the '!unmap_success' check. - Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That's not ideal, it could deliver a SIGBUS with useful information for userspace recovery. This patch (of 5): For years when it comes down to kill a process due to hwpoison, a SIGBUS is delivered only if unmap has been successful. Otherwise, a SIGKILL is delivered. And the reason for that is to prevent the involved process from accessing the hwpoisoned page again. Since then a lot has changed, a hwpoisoned page is marked and upon being re-accessed, the memory-failure handler invokes kill_accessing_process() to kill the process immediately. So let's take out the '!unmap_success' factor and try to deliver SIGBUS if possible. Link: https://lkml.kernel.org/r/20240524215306.2705454-1-jane.chu@oracle.com Link: https://lkml.kernel.org/r/20240524215306.2705454-2-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: use update_mmu_tlb_range() to simplify codeBang Li
Let us simplify the code by update_mmu_tlb_range(). Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() ↵David Hildenbrand
and vmf_insert_mixed() For now we only get the (small) zeropage mapped to user space in four cases (excluding VM_PFNMAP mappings, such as /proc/vmstat): (1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON): do_anonymous_page() will not refcount it and map it pte_mkspecial() (2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem (MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and map it pte_mkspecial(). (3) KSM in mergeable VMA (anonymous VMA or COW mapping). cmp_and_merge_page() will not refcount it and map it pte_mkspecial(). (4) FSDAX as an optimization for holes. vmf_insert_mixed()->__vm_insert_mixed() might end up calling insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the zeropage and not mapping it pte_mkspecial(). With CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will not refcount it and map it pte_mkspecial(). In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax"). Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page() would currently return the zeropage. We'll refcount the zeropage when mapping and when unmapping. Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP, vm_normal_page() would currently refuse to return the zeropage. So we'd refcount it when mapping but not when unmapping it ... do we have fsdax without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to tell. Independent of that, we should never refcount the zeropage when we might be holding that reference for a long time, because even without an accounting imbalance we might overflow the refcount. As there is interest in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean support for that in the cases where it makes sense: (A) Never refcount the zeropage when mapping it: In insert_page(), special-case the zeropage, do not refcount it, and use pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page() looks cleaner than branching off to insert_pfn(). (B) Never refcount the zeropage when unmapping it: In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a VM_WARN_ON_ONCE() sanity check if we'd ever return the zeropage, which could happen if someone forgets to set pte_mkspecial() when mapping the zeropage. Document that. (C) Allow the zeropage only where reasonable s390x never wants the zeropage in some processes running legacy KVM guests that make use of storage keys. So disallow that. Further, using the zeropage in COW mappings is unproblematic (just what we do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it and GUP with FOLL_LONGTERM would work as expected. Similarly, mappings that can never have writable PTEs (implying no write faults) are also not problematic, because nothing could end up mapping the PTE writable by mistake later. But in case we could have writable PTEs, we'll only allow the zeropage in FSDAX VMAs, that are incompatible with GUP and are blocked there completely. We'll always require the zeropage to be mapped with pte_special(). GUP-fast will reject the zeropage that way, but GUP-slow will allow it. (Note that GUP does not refcount the zeropage with FOLL_PIN, because there were issues with overflowing the refcount in the past). Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to catch early during testing if we'd ever find a zeropage unexpectedly in code that wants to upgrade write permissions. Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the stale comment regarding reserved pages from insert_page(). Note that: * we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and vmf_insert_pfn() would allow the zeropage in some cases and not refcount it. * vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP mappings and we'll leave that alone for now. People can simply use one of the other interfaces. * we won't bother with the huge zeropage for now. It's never PTE-mapped and also GUP does not special-case it yet. Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory: move page_count() check into validate_page_before_insert()David Hildenbrand
Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()", v2. There is interest in mapping zeropages via vm_insert_pages() [1] into MAP_SHARED mappings. For now, we only get zeropages in MAP_SHARED mappings via vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some cases because we refcount the zeropage when mapping it but not necessarily always when unmapping it ... and we should actually never refcount it. It's all a bit tricky, especially how zeropages in MAP_SHARED mappings interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x forbidding the shared zeropage (rewrite [2] s now upstream). This series tries to take the careful approach of only allowing the zeropage where it is likely safe to use (which should cover the existing FSDAX use case and [1]), preventing that it could accidentally get mapped writable during a write fault, mprotect() etc, and preventing issues with FOLL_LONGTERM in the future with other users. Tested with a patch from Vincent that uses the zeropage in context of [1]. [1] https://lkml.kernel.org/r/20240430111354.637356-1-vdonnefort@google.com [2] https://lkml.kernel.org/r/20240411161441.910170-1-david@redhat.com This patch (of 3): We'll now also cover the case where insert_page() is called from __vm_insert_mixed(), which sounds like the right thing to do. Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/swap: reduce swap cache search spaceKairui Song
Currently we use one swap_address_space for every 64M chunk to reduce lock contention, this is like having a set of smaller swap files inside one swap device. But when doing swap cache look up or insert, we are still using the offset of the whole large swap device. This is OK for correctness, as the offset (key) is unique. But Xarray is specially optimized for small indexes, it creates the radix tree levels lazily to be just enough to fit the largest key stored in one Xarray. So we are wasting tree nodes unnecessarily. For 64M chunk it should only take at most 3 levels to contain everything. But if we are using the offset from the whole swap device, the offset (key) value will be way beyond 64M, and so will the tree level. Optimize this by using a new helper swap_cache_index to get a swap entry's unique offset in its own 64M swap_address_space. I see a ~1% performance gain in benchmark and actual workload with high memory pressure. Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap cache): Before: 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps After (1.8% faster): 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps Similar result with MySQL and sysbench using swap: Before: 94055.61 qps After (0.8% faster): 94834.91 qps Radix tree slab usage is also very slightly lower. Link: https://lkml.kernel.org/r/20240521175854.96038-12-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: drop page_index and simplify folio_indexKairui Song
There are two helpers for retrieving the index within address space for mixed usage of swap cache and page cache: - page_index - folio_index This commit drops page_index, as we have eliminated all users, and converts folio_index's helper __page_file_index to use folio to avoid the page conversion. Link: https://lkml.kernel.org/r/20240521175854.96038-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/swap: get the swap device offset directlyKairui Song
folio_file_pos and page_file_offset are for mixed usage of swap cache and page cache, it can't be page cache here, so introduce a new helper to get the swap offset in swap device directly. Need to include swapops.h in mm/swap.h to ensure swp_offset is always defined before use. Link: https://lkml.kernel.org/r/20240521175854.96038-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out balance_wb_limits to remove repeated codeKemeng Shi
Factor out balance_wb_limits to remove repeated code [shikemeng@huaweicloud.com: add comment] Link: https://lkml.kernel.org/r/20240606033547.344376-1-shikemeng@huaweicloud.com [akpm@linux-foundation.org: s/fileds/fields/ in comment] Link: https://lkml.kernel.org/r/20240514125254.142203-9-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out wb_dirty_exceeded to remove repeated codeKemeng Shi
Factor out wb_dirty_exceeded to remove repeated code Link: https://lkml.kernel.org/r/20240514125254.142203-8-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out balance_domain_limits to remove repeated codeKemeng Shi
Factor out balance_domain_limits to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out wb_dirty_freerun to remove more repeated freerun codeKemeng Shi
Factor out wb_dirty_freerun to remove more repeated freerun code. Link: https://lkml.kernel.org/r/20240514125254.142203-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out code of freerun to remove repeated codeKemeng Shi
Factor out code of freerun into new helper functions domain_poll_intv and domain_dirty_freerun to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out domain_over_bg_thresh to remove repeated codeKemeng Shi
Factor out domain_over_bg_thresh from wb_over_bg_thresh to remove repeated code. Link: https://lkml.kernel.org/r/20240514125254.142203-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: add general function domain_dirty_avail to calculate dirty and ↵Kemeng Shi
avail of domain Add general function domain_dirty_avail to calculate dirty and avail for either dirty limit or background writeback in either global domain or wb domain. Link: https://lkml.kernel.org/r/20240514125254.142203-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03writeback: factor out wb_bg_dirty_limits to remove repeated codeKemeng Shi
Patch series "Add helper functions to remove repeated code and improve readability of cgroup writeback", v2. This series adds a lot of helpers to remove repeated code between domain and wb; dirty limit and dirty background; global domain and wb domain. The helpers also improve readability. More details can be found in the respective patches. A simple domain hierarchy is tested: global domain (> 20G) | cgroup domain1(10G) | wb1 | fio Test steps: /* make it easy to observe */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 3000 > /proc/sys/vm/dirty_writeback_centisecs /* create cgroup domain */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ /* run fio to generate dirty pages */ fio -name test -filename=/bdi1/file -size=xxx -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 When fio size is 1G, the wb is in freerun state and dirty pages are only written back when dirty inode is expired after 30 seconds. When fio size is 2G, the dirty pages keep being written back and bandwidth of fio is limited. This patch (of 8): Similar to wb_dirty_limits which calculates dirty and thresh of wb, wb_bg_dirty_limits calculates background dirty and background thresh of wb. With wb_bg_dirty_limits, we could remove repeated code in wb_over_bg_thresh. Link: https://lkml.kernel.org/r/20240514125254.142203-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20240514125254.142203-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: vmscan: reset sc->priority on retryShakeel Butt
The commit 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration") added a retry reclaim heuristic to iterate all the cgroups before returning an unsuccessful reclaim but missed to reset the sc->priority. Let's fix it. Link: https://lkml.kernel.org/r/20240529154911.3008025-1-shakeel.butt@linux.dev Fixes: 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Tested-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: vmscan: restore incremental cgroup iterationJohannes Weiner
Currently, reclaim always walks the entire cgroup tree in order to ensure fairness between groups. While overreclaim is limited in shrink_lruvec(), many of our systems have a sizable number of active groups, and an even bigger number of idle cgroups with cache left behind by previous jobs; the mere act of walking all these cgroups can impose significant latency on direct reclaimers. In the past, we've used a save-and-restore iterator that enabled incremental tree walks over multiple reclaim invocations. This ensured fairness, while keeping the work of individual reclaimers small. However, in edge cases with a lot of reclaim concurrency, individual reclaimers would sometimes not see enough of the cgroup tree to make forward progress and (prematurely) declare OOM. Consequently we switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not share cgroup iteration between reclaimers"). To address the latency problem without bringing back the premature OOM issue, reinstate the shared iteration, but with a restart condition to do the full walk in the OOM case - similar to what we do for memory.low enforcement and active page protection. In the worst case, we do one more full tree walk before declaring OOM. But the vast majority of direct reclaim scans can then finish much quicker, while fairness across the tree is maintained: - Before this patch, we observed that direct reclaim always takes more than 100us and most direct reclaim time is spent in reclaim cycles lasting between 1ms and 1 second. Almost 40% of direct reclaim time was spent on reclaim cycles exceeding 100ms. - With this patch, almost all page reclaim cycles last less than 10ms, and a good amount of direct page reclaim finishes in under 100us. No page reclaim cycles lasting over 100ms were observed anymore. The shared iterator state is maintaned inside the target cgroup, so fair and incremental walks are performed during both global reclaim and cgroup limit reclaim of complex subtrees. Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Facebook Kernel Team <kernel-team@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: shmem: use folio_alloc_mpol() in shmem_alloc_folio()Kefeng Wang
Let's change shmem_alloc_folio() to take a order and use folio_alloc_mpol() helper, then directly use it for normal or large folio to cleanup code. Link: https://lkml.kernel.org/r/20240515070709.78529-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: mempolicy: use folio_alloc_mpol() in alloc_migration_target_by_mpol()Kefeng Wang
Convert to use folio_alloc_mpol() to make vma_alloc_folio_noprof() to use folio throughout. Link: https://lkml.kernel.org/r/20240515070709.78529-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: mempolicy: use folio_alloc_mpol_noprof() in vma_alloc_folio_noprof()Kefeng Wang
Convert to use folio_alloc_mpol_noprof() to make vma_alloc_folio_noprof() to use folio throughout. Link: https://lkml.kernel.org/r/20240515070709.78529-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: add folio_alloc_mpol()Kefeng Wang
Patch series "mm: convert to folio_alloc_mpol()". This patch (of 4): This adds a new folio_alloc_mpol() like folio_alloc() but allocate folio according to NUMA mempolicy. Link: https://lkml.kernel.org/r/20240515070709.78529-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240515070709.78529-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb: drop node_alloc_noretry from alloc_fresh_hugetlb_folioOscar Salvador
Since commit d67e32f26713 ("hugetlb: restructure pool allocations"), the parameter node_alloc_noretry from alloc_fresh_hugetlb_folio() is not used, so drop it. Link: https://lkml.kernel.org/r/20240516081035.5651-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/vmscan: update stale references to shrink_page_listIllia Ostapyshyn
Commit 49fd9b6df54e ("mm/vmscan: fix a lot of comments") renamed shrink_page_list() to shrink_folio_list(). Fix up the remaining references to the old name in comments and documentation. Link: https://lkml.kernel.org/r/20240517091348.1185566-1-illia@yshyn.com Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb: constify ctl_table arguments of utility functionsThomas Weißschuh
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers. As the function prototype of all sysctl handlers throughout the tree needs to stay consistent that change will be done in one commit. To reduce the size of that final commit, switch utility functions which are not bound by "typedef proc_handler" to "const struct ctl_table". No functional change. Link: https://lkml.kernel.org/r/20240518-sysctl-const-handler-hugetlb-v1-1-47e34e2871b2@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Joel Granados <j.granados@samsung.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: avoid overflows in dirty throttling logicJan Kara
The dirty throttling logic is interspersed with assumptions that dirty limits in PAGE_SIZE units fit into 32-bit (so that various multiplications fit into 64-bits). If limits end up being larger, we will hit overflows, possible divisions by 0 etc. Fix these problems by never allowing so large dirty limits as they have dubious practical value anyway. For dirty_bytes / dirty_background_bytes interfaces we can just refuse to set so large limits. For dirty_ratio / dirty_background_ratio it isn't so simple as the dirty limit is computed from the amount of available memory which can change due to memory hotplug etc. So when converting dirty limits from ratios to numbers of pages, we just don't allow the result to exceed UINT_MAX. This is root-only triggerable problem which occurs when the operator sets dirty limits to >16 TB. Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Zach O'Keefe <zokeefe@google.com> Reviewed-By: Zach O'Keefe <zokeefe@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"Jan Kara
Patch series "mm: Avoid possible overflows in dirty throttling". Dirty throttling logic assumes dirty limits in page units fit into 32-bits. This patch series makes sure this is true (see patch 2/2 for more details). This patch (of 2): This reverts commit 9319b647902cbd5cc884ac08a8a6d54ce111fc78. The commit is broken in several ways. Firstly, the removed (u64) cast from the multiplication will introduce a multiplication overflow on 32-bit archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the default settings with 4GB of RAM will trigger this). Secondly, the div64_u64() is unnecessarily expensive on 32-bit archs. We have div64_ul() in case we want to be safe & cheap. Thirdly, if dirty thresholds are larger than 1<<32 pages, then dirty balancing is going to blow up in many other spectacular ways anyway so trying to fix one possible overflow is just moot. Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz Fixes: 9319b647902c ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-By: Zach O'Keefe <zokeefe@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/util: Use dedicated slab buckets for memdup_user()Kees Cook
Both memdup_user() and vmemdup_user() handle allocations that are regularly used for exploiting use-after-free type confusion flaws in the kernel (e.g. prctl() PR_SET_VMA_ANON_NAME[1] and setxattr[2][3][4] respectively). Since both are designed for contents coming from userspace, it allows for userspace-controlled allocation sizes. Use a dedicated set of kmalloc buckets so these allocations do not share caches with the global kmalloc buckets. After a fresh boot under Ubuntu 23.10, we can see the caches are already in active use: # grep ^memdup /proc/slabinfo memdup_user-8k 4 4 8192 4 8 : ... memdup_user-4k 8 8 4096 8 8 : ... memdup_user-2k 16 16 2048 16 8 : ... memdup_user-1k 0 0 1024 16 4 : ... memdup_user-512 0 0 512 16 2 : ... memdup_user-256 0 0 256 16 1 : ... memdup_user-128 0 0 128 32 1 : ... memdup_user-64 256 256 64 64 1 : ... memdup_user-32 512 512 32 128 1 : ... memdup_user-16 1024 1024 16 256 1 : ... memdup_user-8 2048 2048 8 512 1 : ... memdup_user-192 0 0 192 21 1 : ... memdup_user-96 168 168 96 42 1 : ... Link: https://starlabs.sg/blog/2023/07-prctl-anon_vma_name-an-amusing-heap-spray/ [1] Link: https://duasynt.com/blog/linux-kernel-heap-spray [2] Link: https://etenal.me/archives/1336 [3] Link: https://github.com/a13xp0p0v/kernel-hack-drill/blob/master/drill_exploit_uaf.c [4] Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03mm/slab: Introduce kmem_buckets_create() and familyKees Cook
Dedicated caches are available for fixed size allocations via kmem_cache_alloc(), but for dynamically sized allocations there is only the global kmalloc API's set of buckets available. This means it isn't possible to separate specific sets of dynamically sized allocations into a separate collection of caches. This leads to a use-after-free exploitation weakness in the Linux kernel since many heap memory spraying/grooming attacks depend on using userspace-controllable dynamically sized allocations to collide with fixed size allocations that end up in same cache. While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense against these kinds of "type confusion" attacks, including for fixed same-size heap objects, we can create a complementary deterministic defense for dynamically sized allocations that are directly user controlled. Addressing these cases is limited in scope, so isolating these kinds of interfaces will not become an unbounded game of whack-a-mole. For example, many pass through memdup_user(), making isolation there very effective. In order to isolate user-controllable dynamically-sized allocations from the common system kmalloc allocations, introduce kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for where caller tracking is needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback is needed. Note that these caches are specifically flagged with SLAB_NO_MERGE, since merging would defeat the entire purpose of the mitigation. This can also be used in the future to extend allocation profiling's use of code tagging to implement per-caller allocation cache isolation[1] even for dynamic allocations. Memory allocation pinning[2] is still needed to plug the Use-After-Free cross-allocator weakness (where attackers can arrange to free an entire slab page and have it reallocated to a different cache), but that is an existing and separate issue which is complementary to this improvement. Development continues for that feature via the SLAB_VIRTUAL[3] series (which could also provide guard pages -- another complementary improvement). Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1] Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2] Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3] Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03mm/slab: Introduce kvmalloc_buckets_node() that can take kmem_buckets argumentKees Cook
Plumb kmem_buckets arguments through kvmalloc_node_noprof() so it is possible to provide an API to perform kvmalloc-style allocations with a particular set of buckets. Introduce kvmalloc_buckets_node() that takes a kmem_buckets argument. Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03mm/slab: Plumb kmem_buckets into __do_kmalloc_node()Kees Cook
Introduce CONFIG_SLAB_BUCKETS which provides the infrastructure to support separated kmalloc buckets (in the following kmem_buckets_create() patches and future codetag-based separation). Since this will provide a mitigation for a very common case of exploits, it is recommended to enable this feature for general purpose distros. By default, the new Kconfig will be enabled if CONFIG_SLAB_FREELIST_HARDENED is enabled (and it is added to the hardening.config Kconfig fragment). To be able to choose which buckets to allocate from, make the buckets available to the internal kmalloc interfaces by adding them as the second argument, rather than depending on the buckets being chosen from the fixed set of global buckets. Where the bucket is not available, pass NULL, which means "use the default system kmalloc bucket set" (the prior existing behavior), as implemented in kmalloc_slab(). To avoid adding the extra argument when !CONFIG_SLAB_BUCKETS, only the top-level macros and static inlines use the buckets argument (where they are stripped out and compiled out respectively). The actual extern functions can then be built without the argument, and the internals fall back to the global kmalloc buckets unconditionally. Co-developed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03mm/slab: Introduce kmem_buckets typedefKees Cook
Encapsulate the concept of a single set of kmem_caches that are used for the kmalloc size buckets. Redefine kmalloc_caches as an array of these buckets (for the different global cache buckets). Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03slab, rust: extend kmalloc() alignment guarantees to remove Rust paddingVlastimil Babka
Slab allocators have been guaranteeing natural alignment for power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)"), while any other sizes are guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although in practice are aligned more than that in non-debug scenarios). Rust's allocator API specifies size and alignment per allocation, which have to satisfy the following rules, per Alice Ryhl [1]: 1. The alignment is a power of two. 2. The size is non-zero. 3. When you round up the size to the next multiple of the alignment, then it must not overflow the signed type isize / ssize_t. In order to map this to kmalloc()'s guarantees, some requested allocation sizes have to be padded to the next power-of-two size [2]. For example, an allocation of size 96 and alignment of 32 will be padded to an allocation of size 128, because the existing kmalloc-96 bucket doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab debugging active, the layout of the kmalloc-96 slabs however naturally align the objects to 32 bytes, so extending the size to 128 bytes is wasteful. To improve the situation we can extend the kmalloc() alignment guarantees in a way that 1) doesn't change the current slab layout (and thus does not increase internal fragmentation) when slab debugging is not active 2) reduces waste in the Rust allocator use case 3) is a superset of the current guarantee for power-of-two sizes. The extended guarantee is that alignment is at least the largest power-of-two divisor of the requested size. For power-of-two sizes the largest divisor is the size itself, but let's keep this case documented separately for clarity. For current kmalloc size buckets, it means kmalloc-96 will guarantee alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes. This covers the rules 1 and 2 above of Rust's API as long as the size is a multiple of the alignment. The Rust layer should now only need to round up the size to the next multiple if it isn't, while enforcing the rule 3. Implementation-wise, this changes the alignment calculation in create_boot_cache(). While at it also do the calulation only for caches with the SLAB_KMALLOC flag, because the function is also used to create the initial kmem_cache and kmem_cache_node caches, where no alignment guarantee is necessary. In the Rust allocator's krealloc_aligned(), remove the code that padded sizes to the next power of two (suggested by Alice Ryhl) as it's no longer necessary with the new guarantees. Reported-by: Alice Ryhl <aliceryhl@google.com> Reported-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2] Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>