summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-01-13mm: remove the non-useful else after a break in a if statementKeren Sun
Remove the else block since there is already a break in the statement of if (iter->oom_lock), just set iter->oom_lock true after the if block ends. Link: https://lkml.kernel.org/r/20241115235744.1419580-4-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: remove unnecessary whitespace before a quoted newlineKeren Sun
Remove whitespaces before newlines for strings in pr_warn_once() Link: https://lkml.kernel.org/r/20241115235744.1419580-3-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: prefer 'unsigned int' to bare use of 'unsigned'Keren Sun
Patch series "mm: fix format issues and param types" Change the param 'mode' from type 'unsigned' to 'unsigned int' in memcg_event_wake() and memcg_oom_wake_function(), and for the param 'nid' in VM_BUG_ON(). Link: https://lkml.kernel.org/r/20241115235744.1419580-2-kerensun@google.com Signed-off-by: Keren Sun <kerensun@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13filemap: remove unused folio_add_wait_queueDr. David Alan Gilbert
folio_add_wait_queue() has been unused since 2021's commit 850cba069c26 ("cachefiles: Delete the cachefiles driver pending rewrite") Remove it. Link: https://lkml.kernel.org/r/20241116151446.95555-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/rodata_test: verify test data is unchanged, rather than non-zeroPetr Tesarik
Verify that the test variable holds the initialization value, rather than any non-zero value. Link: https://lkml.kernel.org/r/386ffda192eb4a26f68c526c496afd48a5cd87ce.1732016064.git.ptesarik@suse.com Signed-off-by: Petr Tesarik <ptesarik@suse.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Jinbum Park <jinb.park7@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/rodata_test: use READ_ONCE() to read const variablePetr Tesarik
Patch series "Fix mm/rodata_test", v2. Make sure that the test actually reads the read-only memory location. Verify that the variable contains the expected value rather than any non-zero value. This patch (of 2): The C compiler may optimize away the memory read of a const variable if its value is known at compile time. In particular, GCC14 with -O2 generates no code at all for test 1, and it generates the following x86_64 instructions for test 3: cmpl $195, 4(%rsp) je .L14 That is, it replaces the read of rodata_test_data with an immediate value and compares it to the value of the local variable "zero". Use READ_ONCE() to undo any such compiler optimizations and enforce a memory read. Link: https://lkml.kernel.org/r/cover.1732016064.git.ptesarik@suse.com Link: https://lkml.kernel.org/r/2a66dee010151b25cb143efb39091ef7530aa00a.1732016064.git.ptesarik@suse.com Fixes: 2959a5f726f6 ("mm: add arch-independent testcases for RODATA") Signed-off-by: Petr Tesarik <ptesarik@suse.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Jinbum Park <jinb.park7@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: shmem: add a kernel command line to change the default huge policy for tmpfsBaolin Wang
Now the tmpfs can allow to allocate any sized large folios, and the default huge policy is still preferred to be 'never'. Due to tmpfs not behaving like other file systems in some cases as previously explained by David[1]: : I think I raised this in the past, but tmpfs/shmem is just like any : other file system .. except it sometimes really isn't and behaves much : more like (swappable) anonymous memory. (or mlocked files) : : There are many systems out there that run without swap enabled, or with : extremely minimal swap (IIRC until recently kubernetes was completely : incompatible with swapping). Swap can even be disabled today for shmem : using a mount option. : : That's a big difference to all other file systems where you are : guaranteed to have backend storage where you can simply evict under : memory pressure (might temporarily fail, of course). : : I *think* that's the reason why we have the "huge=" parameter that also : controls the THP allocations during page faults (IOW possible memory : over-allocation). Maybe also because it was a new feature, and we only : had a single THP size. Thus adding a new command line to change the default huge policy will be helpful to use the large folios for tmpfs, which is similar to the 'transparent_hugepage_shmem' cmdline for shmem. [1] https://lore.kernel.org/all/cbadd5fe-69d5-4c21-8eb8-3344ed36c721@redhat.com/ Link: https://lkml.kernel.org/r/ff390b2656f0d39649547f8f2cbb30fcb7e7be2d.1732779148.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: shmem: add large folio support for tmpfsBaolin Wang
Add large folio support for tmpfs write and fallocate paths matching the same high order preference mechanism used in the iomap buffered IO path as used in __filemap_get_folio(). Add shmem_mapping_size_orders() to get a hint for the orders of the folio based on the file size which takes care of the mapping requirements. Traditionally, tmpfs only supported PMD-sized large folios. However nowadays with other file systems supporting any sized large folios, and extending anonymous to support mTHP, we should not restrict tmpfs to allocating only PMD-sized large folios, making it more special. Instead, we should allow tmpfs can allocate any sized large folios. Considering that tmpfs already has the 'huge=' option to control the PMD-sized large folios allocation, we can extend the 'huge=' option to allow any sized large folios. The semantics of the 'huge=' mount option are: huge=never: no any sized large folios huge=always: any sized large folios huge=within_size: like 'always' but respect the i_size huge=advise: like 'always' if requested with madvise() Note: for tmpfs mmap() faults, due to the lack of a write size hint, still allocate the PMD-sized huge folios if huge=always/within_size/advise is set. Moreover, the 'deny' and 'force' testing options controlled by '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same semantics. The 'deny' can disable any sized large folios for tmpfs, while the 'force' can enable PMD sized large folios for tmpfs. Link: https://lkml.kernel.org/r/035bf55fbdebeff65f5cb2cdb9907b7d632c3228.1732779148.git.baolin.wang@linux.alibaba.com Co-developed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: shmem: change shmem_huge_global_enabled() to return huge order bitmapBaolin Wang
Change the shmem_huge_global_enabled() to return the suitable huge order bitmap, and return 0 if huge pages are not allowed. This is a preparation for supporting various huge orders allocation of tmpfs in the following patches. No functional changes. Link: https://lkml.kernel.org/r/9dce1cfad3e9c1587cf1a0ea782ddbebd0e92984.1732779148.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13kasan: make kasan_record_aux_stack_noalloc() the default behaviourPeter Zijlstra
kasan_record_aux_stack_noalloc() was introduced to record a stack trace without allocating memory in the process. It has been added to callers which were invoked while a raw_spinlock_t was held. More and more callers were identified and changed over time. Is it a good thing to have this while functions try their best to do a locklessly setup? The only downside of having kasan_record_aux_stack() not allocate any memory is that we end up without a stacktrace if stackdepot runs out of memory and at the same stacktrace was not recorded before To quote Marco Elver from https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/ | I'd be in favor, it simplifies things. And stack depot should be | able to replenish its pool sufficiently in the "non-aux" cases | i.e. regular allocations. Worst case we fail to record some | aux stacks, but I think that's only really bad if there's a bug | around one of these allocations. In general the probabilities | of this being a regression are extremely small [...] Make the kasan_record_aux_stack_noalloc() behaviour default as kasan_record_aux_stack(). [bigeasy@linutronix.de: dressed the diff as patch] Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de Fixes: 7cb3007ce2da ("kasan: generic: introduce kasan_record_aux_stack_noalloc()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: <kasan-dev@googlegroups.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: syzkaller-bugs@googlegroups.com Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/memory: fix a comment typo in lock_mm_and_find_vma()Chin Yik Ming
s/equivalend/equivalent/ Link: https://lkml.kernel.org/r/20241120105041.2394283-1-yikming2222@gmail.com Signed-off-by: Chin Yik Ming <yikming2222@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: change type of cma_area_count to unsigned intJiale Yang
Prefer 'unsigned int' over plain 'unsigned'. Also make it consistent with mm/cma.c Link: https://lkml.kernel.org/r/tencent_1E5E3AA25C261196D8C1F7097F130E382008@qq.com Signed-off-by: Jiale Yang <295107659@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/hugetlb_cgroup: avoid useless return in void functionPintu Kumar
The return statement at the end of void function is unnecessary. Just remove it as part of cleanup. Link: https://lkml.kernel.org/r/20241122173558.20670-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: mmap_lock: optimize mmap_lock tracepointsShakeel Butt
We are starting to deploy mmap_lock tracepoint monitoring across our fleet and the early results showed that these tracepoints are consuming significant amount of CPUs in kernfs_path_from_node when enabled. It seems like the kernel is trying to resolve the cgroup path in the fast path of the locking code path when the tracepoints are enabled. In addition for some application their metrics are regressing when monitoring is enabled. The cgroup path resolution can be slow and should not be done in the fast path. Most userspace tools, like bpftrace, provides functionality to get the cgroup path from cgroup id, so let's just trace the cgroup id and the users can use better tools to get the path in the slow path. Link: https://lkml.kernel.org/r/20241125171617.113892-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/damon/core: remove duplicate list_empty quota->goals checkHonggyu Kim
damos_set_effective_quota() checks quota contidions but there are some duplicate checks for quota->goals inside. This patch reduces one of if statement to simplify the esz calculation logic by setting esz as ULONG_MAX by default. Link: https://lkml.kernel.org/r/20241125184307.41746-1-sj@kernel.org Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13slab: allocate frozen pagesMatthew Wilcox (Oracle)
Since slab does not use the page refcount, it can allocate and free frozen pages, saving one atomic operation per free. Link: https://lkml.kernel.org/r/20241125210149.2976098-16-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/mempolicy: add alloc_frozen_pages()Matthew Wilcox (Oracle)
Provide an interface to allocate pages from the page allocator without incrementing their refcount. This saves an atomic operation on free, which may be beneficial to some users (eg slab). Link: https://lkml.kernel.org/r/20241125210149.2976098-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: add __alloc_frozen_pages()Matthew Wilcox (Oracle)
Defer the initialisation of the page refcount to the new __alloc_pages() wrapper and turn the old __alloc_pages() into __alloc_frozen_pages(). Link: https://lkml.kernel.org/r/20241125210149.2976098-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to end of __alloc_pages()Matthew Wilcox (Oracle)
Remove some code duplication by calling set_page_refcounted() at the end of __alloc_pages() instead of after each call that can allocate a page. That means that we free a frozen page if we've exceeded the allowed memcg memory. Link: https://lkml.kernel.org/r/20241125210149.2976098-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_slowpath()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_slowpath(). Link: https://lkml.kernel.org/r/20241125210149.2976098-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of ↵Matthew Wilcox (Oracle)
__alloc_pages_direct_reclaim() In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_direct_reclaim(). Link: https://lkml.kernel.org/r/20241125210149.2976098-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of ↵Matthew Wilcox (Oracle)
__alloc_pages_direct_compact() In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_direct_compact(). Link: https://lkml.kernel.org/r/20241125210149.2976098-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of __alloc_pages_may_oom()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_may_oom(). Link: https://lkml.kernel.org/r/20241125210149.2976098-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of ↵Matthew Wilcox (Oracle)
__alloc_pages_cpuset_fallback() In preparation for allocating frozen pages, stop initialising the page refcount in __alloc_pages_cpuset_fallback(). Link: https://lkml.kernel.org/r/20241125210149.2976098-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of get_page_from_freelist()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in get_page_from_freelist(). Link: https://lkml.kernel.org/r/20241125210149.2976098-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of prep_new_page()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in prep_new_page(). Link: https://lkml.kernel.org/r/20241125210149.2976098-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook()Matthew Wilcox (Oracle)
In preparation for allocating frozen pages, stop initialising the page refcount in post_alloc_hook(). Link: https://lkml.kernel.org/r/20241125210149.2976098-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: export free_frozen_pages() instead of free_unref_page()Matthew Wilcox (Oracle)
We already have the concept of "frozen pages" (eg page_ref_freeze()), so let's not complicate things by also having the concept of "unref pages". Link: https://lkml.kernel.org/r/20241125210149.2976098-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: make alloc_pages_mpol() staticMatthew Wilcox (Oracle)
All callers outside mempolicy.c now use folio_alloc_mpol() thanks to Kefeng's cleanups, so we can remove this as a visible symbol. And also remove the alloc_hooks for alloc_pages_mpol(), since all users in mempolicy.c are using the nonprof version. Link: https://lkml.kernel.org/r/20241125210149.2976098-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/page_alloc: cache page_zone() result in free_unref_page()Matthew Wilcox (Oracle)
Patch series "Allocate and free frozen pages", v3. Slab does not need to use the page refcount at all, and it can avoid an atomic operation on page free. Hugetlb wants to delay setting the refcount until it has assembled a complete gigantic page. We already have the ability to freeze a page (safely reduce its reference count to 0), so this patchset adds APIs to allocate and free pages which are in a frozen state. This patchset is also a step towards the Glorious Future in which struct page doesn't have a refcount; the users which need a refcount will have one in their per-allocation memdesc. This patch (of 15): Save 17 bytes of text by calculating page_zone() once instead of twice. Link: https://lkml.kernel.org/r/20241125210149.2976098-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241125210149.2976098-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm: migrate: remove unused argument vma from migrate_misplaced_folio()Donet Tom
Commit ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation + checks under PTL") removed the code that had used the vma argument in migrate_misplaced_folio. Since the vma argument was no longer used in migrate_misplaced_folio, this patch removes it. Link: https://lkml.kernel.org/r/20241126155655.466186-1-donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/zswap: add LRU_STOP to comment about dropping the lru lockAlice Ryhl
This function has been able to return LRU_STOP since commit b49547ade38a ("mm/zswap: stop lru list shrinking when encounter warm region"). To reduce confusion, update the comment to also list LRU_STOP as an option. Link: https://lkml.kernel.org/r/20241127-lru-stop-comment-v1-1-f54a7cba9429@google.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13mm/slab: fix kernel-doc func param namesRandy Dunlap
Use corrected function parameter names to eliminate kernel-doc warnings: slab.h:142: warning: Function parameter or struct member 's' not described in 'slab_folio' slab.h:142: warning: Excess function parameter 'slab' description in 'slab_folio' slab.h:168: warning: Function parameter or struct member 's' not described in 'slab_page' slab.h:168: warning: Excess function parameter 'slab' description in 'slab_page' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-01-12mm: kmemleak: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @@ constant C; @@ - msecs_to_jiffies(C * 1000) + secs_to_jiffies(C) @@ constant C; @@ - msecs_to_jiffies(C * MSEC_PER_SEC) + secs_to_jiffies(C) Link: https://lkml.kernel.org/r/20241210-converge-secs-to-jiffies-v3-6-ddfefd7e9f2a@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Daniel Mack <daniel@zonque.org> Cc: David Airlie <airlied@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haojian Zhuang <haojian.zhuang@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jack Wang <jinpu.wang@cloud.ionos.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jeff Johnson <jjohnson@kernel.org> Cc: Jeff Johnson <quic_jjohnson@quicinc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jeroen de Borst <jeroendb@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalle Valo <kvalo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Ofir Bitton <obitton@habana.ai> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Praveen Kaligineedi <pkaligineedi@google.com> Cc: Ray Jui <rjui@broadcom.com> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Scott Branden <sbranden@broadcom.com> Cc: Shailend Chand <shailend@google.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Simon Horman <horms@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: fix assertion in folio_end_read()Matthew Wilcox (Oracle)
We only need to assert that the uptodate flag is clear if we're going to set it. This hasn't been a problem before now because we have only used folio_end_read() when completing with an error, but it's convenient to use it in squashfs if we discover the folio is already uptodate. Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.Donet Tom
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and pgdemote_khugepaged stats in vmstat are not being updated. Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") moved the pgdemote vmstat update from demote_folio_list() to shrink_inactive_list(), which is in the normal LRU path. As a result, the pgdemote stats are updated correctly for the normal LRU but not for MGLRU. To address this, we have added the pgdemote stat update in the evict_folios() function, which is in the MGLRU path. With this patch, the pgdemote stats will now be updated correctly when MGLRU is enabled. Without this patch vmstat output when MGLRU is enabled ====================================================== pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 With this patch vmstat output when MGLRU is enabled =================================================== pgdemote_kswapd 43234 pgdemote_direct 4691 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12vmstat: disable vmstat_work on vmstat_cpu_down_prep()Koichiro Den
The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on vmstat_cpu_down_prep()") introduced another warning during the boot phase so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it and reattempts the original fix. Even after mm/vmstat:online teardown, shepherd may still queue work for the dying cpu until the cpu is removed from online mask. While it's quite rare, this means that after unbind_workers() unbinds a per-cpu kworker, it potentially runs vmstat_update for the dying CPU on an irrelevant cpu before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results in the following error with the backtrace. BUG: using smp_processor_id() in preemptible [00000000] code: \ kworker/7:3/1702 caller is refresh_cpu_vm_stats+0x235/0x5f0 CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G Tainted: [N]=TEST Workqueue: mm_percpu_wq vmstat_update Call Trace: <TASK> dump_stack_lvl+0x8d/0xb0 check_preemption_disabled+0xce/0xe0 refresh_cpu_vm_stats+0x235/0x5f0 vmstat_update+0x17/0xa0 process_one_work+0x869/0x1aa0 worker_thread+0x5e5/0x1100 kthread+0x29e/0x380 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> So, for mm/vmstat:online, disable vmstat_work reliably on teardown and symmetrically enable it on startup. For secondary CPUs during CPU hotplug scenarios, ensure the delayed work is disabled immediately after the initialization. These CPUs are not yet online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online() will enable the work for them. Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com Signed-off-by: Huacai Chen <chenhuacai@kernel.org> Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Suggested-by: Huacai Chen <chenhuacai@kernel.org> Tested-by: Charalampos Mitrodimas <charmitro@posteo.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: clear uffd-wp PTE/PMD state on mremap()Ryan Roberts
When mremap()ing a memory region previously registered with userfaultfd as write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in flag clearing leads to a mismatch between the vma flags (which have uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to trigger a warning in page_table_check_pte_flags() due to setting the pte to writable while uffd-wp is still set. Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any such mremap() so that the values are consistent with the existing clearing of VM_UFFD_WP. Be careful to clear the logical flag regardless of its physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE, huge PMD and hugetlb paths. Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/ Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: zswap: properly synchronize freeing resources during CPU hotunplugYosry Ahmed
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the current CPU at the beginning of the operation is retrieved and used throughout. However, since neither preemption nor migration are disabled, it is possible that the operation continues on a different CPU. If the original CPU is hotunplugged while the acomp_ctx is still in use, we run into a UAF bug as some of the resources attached to the acomp_ctx are freed during hotunplug in zswap_cpu_comp_dead() (i.e. acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp). The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") when the switch to the crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was retrieved using get_cpu_ptr() which disables preemption and makes sure the CPU cannot go away from under us. Preemption cannot be disabled with the crypto_acomp API as a sleepable context is needed. Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating and freeing resources with compression/decompression paths. Make sure that acomp_ctx.req is NULL when the resources are freed. In the compression/decompression paths, check if acomp_ctx.req is NULL after acquiring the mutex (meaning the CPU was offlined) and retry on the new CPU. The initialization of acomp_ctx.mutex is moved from the CPU hotplug callback to the pool initialization where it belongs (where the mutex is allocated). In addition to adding clarity, this makes sure that CPU hotplug cannot reinitialize a mutex that is already locked by compression/decompression. Previously a fix was attempted by holding cpus_read_lock() [1]. This would have caused a potential deadlock as it is possible for code already holding the lock to fall into reclaim and enter zswap (causing a deadlock). A fix was also attempted using SRCU for synchronization, but Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug notifiers [2]. Alternative fixes that were considered/attempted and could have worked: - Refcounting the per-CPU acomp_ctx. This involves complexity in handling the race between the refcount dropping to zero in zswap_[de]compress() and the refcount being re-initialized when the CPU is onlined. - Disabling migration before getting the per-CPU acomp_ctx [3], but that's discouraged and is a much bigger hammer than needed, and could result in subtle performance issues. [1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/ [2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/ [3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/ [yosryahmed@google.com: remove comment] Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/ Reported-by: Sam Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/ Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12Revert "mm: zswap: fix race between [de]compression and CPU hotunplug"Yosry Ahmed
This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef. Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU hotunplug") used the CPU hotplug lock in zswap compress/decompress operations to protect against a race with CPU hotunplug making some per-CPU resources go away. However, zswap compress/decompress can be reached through reclaim while the lock is held, resulting in a potential deadlock as reported by syzbot: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted ------------------------------------------------------ kswapd0/89 is trying to acquire lock: ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline] ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 but task is already holding lock: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __fs_reclaim_acquire mm/page_alloc.c:3853 [inline] fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867 might_alloc include/linux/sched/mm.h:318 [inline] slab_pre_alloc_hook mm/slub.c:4070 [inline] slab_alloc_node mm/slub.c:4148 [inline] __kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337 kmalloc_node_noprof include/linux/slab.h:924 [inline] alloc_worker kernel/workqueue.c:2638 [inline] create_worker+0x11b/0x720 kernel/workqueue.c:2781 workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628 cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194 __cpuhp_invoke_callback_range kernel/cpu.c:965 [inline] cpuhp_invoke_callback_range kernel/cpu.c:989 [inline] cpuhp_up_callbacks kernel/cpu.c:1020 [inline] _cpu_up+0x2b3/0x580 kernel/cpu.c:1690 cpu_up+0x184/0x230 kernel/cpu.c:1722 cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788 cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878 bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892 smp_init+0x34/0x150 kernel/smp.c:1009 kernel_init_freeable+0x417/0x5d0 init/main.c:1569 kernel_init+0x1d/0x2b0 init/main.c:1466 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (cpu_hotplug_lock){++++}-{0:0}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(cpu_hotplug_lock); lock(fs_reclaim); rlock(cpu_hotplug_lock); *** DEADLOCK *** 1 lock held by kswapd0/89: #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline] #0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253 stack backtrace: CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] cpus_read_lock+0x42/0x150 kernel/cpu.c:490 acomp_ctx_get_cpu mm/zswap.c:886 [inline] zswap_compress mm/zswap.c:908 [inline] zswap_store_page mm/zswap.c:1439 [inline] zswap_store+0xa74/0x1ba0 mm/zswap.c:1546 swap_writepage+0x647/0xce0 mm/page_io.c:279 shmem_writepage+0x1248/0x1610 mm/shmem.c:1579 pageout mm/vmscan.c:696 [inline] shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374 shrink_inactive_list mm/vmscan.c:1967 [inline] shrink_list mm/vmscan.c:2205 [inline] shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734 mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575 mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline] memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362 balance_pgdat mm/vmscan.c:6975 [inline] kswapd+0x17b3/0x2f30 mm/vmscan.c:7253 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Revert the change. A different fix for the race with CPU hotunplug will follow. Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sam Sun <samsun1006219@gmail.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm: fix div by zero in bdi_ratio_from_pagesStefan Roesch
During testing it has been detected, that it is possible to get div by zero error in bdi_set_min_bytes. The error is caused by the function bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits. If the dirty threshold is 0, the div by zero is raised. This can happen if the root user is setting: echo 0 > /proc/sys/vm/dirty_ratio The following is a test case: echo 0 > /proc/sys/vm/dirty_ratio cd /sys/class/bdi/<device> echo 1 > strict_limit echo 8192 > min_bytes ==> error is raised. The problem is addressed by returning -EINVAL if dirty_ratio or dirty_bytes is set to 0. [shr@devkernel.io: check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes()] Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io [shr@devkernel.io: v3] Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reported-by: cheung wall <zzqq0103.hey@gmail.com> Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Qiang Zhang <zzqq0103.hey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12filemap: avoid truncating 64-bit offset to 32 bitsMarco Nelissen
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a 64-bit value to 32 bits, leading to a possible infinite loop when writing to an xfs filesystem. Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data") Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm/mempolicy: count MPOL_WEIGHTED_INTERLEAVE to "interleave_hit"Honggyu Kim
Commit fa3bea4e1f82 introduced MPOL_WEIGHTED_INTERLEAVE but it missed adding its counter to "interleave_hit" of numastat, which is located at /sys/devices/system/node/nodeN/ directory. It'd be better to add weighted interleving counter info to the existing "interleave_hit" instead of introducing a new counter "weighted_interleave_hit". Link: https://lkml.kernel.org/r/20241227095737.645-1-honggyu.kim@sk.com Fixes: fa3bea4e1f82 ("mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12mm/kmemleak: fix percpu memory leak detection failureGuo Weikang
kmemleak_alloc_percpu gives an incorrect min_count parameter, causing percpu memory to be considered a gray object. Link: https://lkml.kernel.org/r/20241227092311.3572500-1-guoweikang.kernel@gmail.com Fixes: 8c8685928910 ("mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space") Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Guo Weikang <guoweikang.kernel@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-11mm/slab: Move kvfree_rcu() into SLABUladzislau Rezki (Sony)
Move kvfree_rcu() functionality to the slab_common.c file. The reason to have kvfree_rcu() functionality as part of SLAB is that there is a clear trend and need of closer integration. One of the recent example is creating a barrier function for SLAB caches. Another reason is to prevent of having several implementations of RCU machinery for reclaiming objects after a GP. As future steps, it can be more integrated(easier) with SLAB internals. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Tested-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-01-08mm: Create/affine kswapd to its preferred nodeFrederic Weisbecker
kswapd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise. Use the proper kthread API to achieve that. As a bonus it takes care of CPU-hotplug events and CPU-isolation on its behalf. Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-08mm: Create/affine kcompactd to its preferred nodeFrederic Weisbecker
Kcompactd is dedicated to a specific node. As such it wants to be preferrably affine to it, memory and CPUs-wise. Use the proper kthread API to achieve that. As a bonus it takes care of CPU-hotplug events and CPU-isolation on its behalf. Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-06kernel/cgroup: Add "dmem" memory accounting cgroupMaarten Lankhorst
This code is based on the RDMA and misc cgroup initially, but now uses page_counter. It uses the same min/low/max semantics as the memory cgroup as a result. There's a small mismatch as TTM uses u64, and page_counter long pages. In practice it's not a problem. 32-bits systems don't really come with >=4GB cards and as long as we're consistently wrong with units, it's fine. The device page size may not be in the same units as kernel page size, and each region might also have a different page size (VRAM vs GART for example). The interface is simple: - Call dmem_cgroup_register_region() - Use dmem_cgroup_try_charge to check if you can allocate a chunk of memory, use dmem_cgroup__uncharge when freeing it. This may return an error code, or -EAGAIN when the cgroup limit is reached. In that case a reference to the limiting pool is returned. - The limiting cs can be used as compare function for dmem_cgroup_state_evict_valuable. - After having evicted enough, drop reference to limiting cs with dmem_cgroup_pool_state_put. This API allows you to limit device resources with cgroups. You can see the supported cards in /sys/fs/cgroup/dmem.capacity You need to echo +dmem to cgroup.subtree_control, and then you can partition device memory. Co-developed-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Co-developed-by: Maxime Ripard <mripard@kernel.org> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20241204143112.1250983-1-dev@lankhorst.se Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-06Revert "vmstat: disable vmstat_work on vmstat_cpu_down_prep()"Linus Torvalds
This reverts commit adcfb264c3ed51fbbf5068ddf10d309a63683868. It turns out this just causes a different warning splat instead that seems to be much easier to trigger, so let's revert ASAP. Reported-and-bisected-by: Borislav Petkov <bp@alien8.de> Tested-by: Breno Leitao <leitao@debian.org> Reported-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/all/20250106131817.GAZ3vYGVr3-hWFFPLj@fat_crate.local/ Cc: Koichiro Den <koichiro.den@canonical.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-04Revert "libfs: Add simple_offset_empty()"Chuck Lever
simple_empty() and simple_offset_empty() perform the same task. The latter's use as a canary to find bugs has not found any new issues. A subsequent patch will remove the use of the mtree for iterating directory contents, so revert back to using a similar mechanism for determining whether a directory is indeed empty. Only one such mechanism is ever needed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20241228175522.1854234-3-cel@kernel.org Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>