summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-16mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned longRyan Roberts
ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t could never be bigger than unsigned long. But this assumption soon will not be true on arm64 when using D128 pgtables. In 128 bit page table configuration, unsigned long is 64 bit, but pgprot_t is 128 bit. Passing platform abstracted pgprot_t argument is better as compared to size based data types. Let's change the parameter to directly pass pgprot_t like another similar helper generic_ioremap_prot(). Without this change in place, D128 configuration does not work on arm64 as the top 64 bits gets silently stripped when passing the protection value to this function. Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: fix spellingUjwal Kundur
Fix misspelling flagged by codespell. Link: https://lkml.kernel.org/r/20250215081803.1793-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Documentation/mm: fix spelling mistakeSuchit K
The word watermark was misspelled as "watemark". Link: https://lkml.kernel.org/r/CAO9wTFhe4sf1eVVgijt2cdLPPsUHBj7B=HN-380_JSpve5KbvQ@mail.gmail.com Signed-off-by: Suchit <suchitkarunakaran@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16fs: remove folio_file_mapping()Matthew Wilcox (Oracle)
No callers of this function remain as filesystems no longer see swapfile pages through their normal read/write paths. Link: https://lkml.kernel.org/r/20250217192009.437916-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16fs: remove page_file_mapping()Matthew Wilcox (Oracle)
This wrapper has no more callers. Delete it. Link: https://lkml.kernel.org/r/20250217192009.437916-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16fs: convert block_commit_write() to take a folioMatthew Wilcox (Oracle)
All callers now have a folio, so pass it in instead of converting folio->page->folio. Link: https://lkml.kernel.org/r/20250217192009.437916-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Docs/mm/damon: fix spelling and grammar in ↵Marcelo Moreira
monitoring_intervals_tuning_example.rst This patch fixes some spelling and grammar mistakes in the documentation, improving the readability. - multipled -> multiplied - idential -> identical - minuts -> minutes - efficieny -> efficiency Link: https://lkml.kernel.org/r/20250217215512.12833-1-marcelomoreira1905@gmail.com Signed-off-by: Marcelo Moreira <marcelomoreira1905@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shuah khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16docs/mm: document latest changes to vm_lockSuren Baghdasaryan
Change the documentation to reflect that vm_lock is integrated into vma and replaced with vm_refcnt. Document newly introduced vma_start_read_locked{_nested} functions. Link: https://lkml.kernel.org/r/20250213224655.1680278-19-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: make vma cache SLAB_TYPESAFE_BY_RCUSuren Baghdasaryan
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that object reuse before RCU grace period is over will be detected by lock_vma_under_rcu(). Current checks are sufficient as long as vma is detached before it is freed. The only place this is not currently happening is in exit_mmap(). Add the missing vma_mark_detached() in exit_mmap(). Another issue which might trick lock_vma_under_rcu() during vma reuse is vm_area_dup(), which copies the entire content of the vma into a new one, overriding new vma's vm_refcnt and temporarily making it appear as attached. This might trick a racing lock_vma_under_rcu() to operate on a reused vma if it found the vma before it got reused. To prevent this situation, we should ensure that vm_refcnt stays at detached state (0) when it is copied and advances to attached state only after it is added into the vma tree. Introduce vm_area_init_from() which preserves new vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached state with no current readers when they are freed, lock_vma_under_rcu() will not be able to take vm_refcnt after vma got detached even if vma is reused. vma_mark_attached() in modified to include a release fence to ensure all stores to the vma happen before vm_refcnt gets initialized. Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and will minimize the number of call_rcu() calls. [surenb@google.com: remove atomic_set_release() usage in tools/] Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: prepare lock_vma_under_rcu() for vma reuse possibilitySuren Baghdasaryan
Once we make vma cache SLAB_TYPESAFE_BY_RCU, it will be possible for a vma to be reused and attached to another mm after lock_vma_under_rcu() locks the vma. lock_vma_under_rcu() should ensure that vma_start_read() is using the original mm and after locking the vma it should ensure that vma->vm_mm has not changed from under us. Link: https://lkml.kernel.org/r/20250213224655.1680278-17-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: remove extra vma_numab_state_init() callSuren Baghdasaryan
vma_init() already memset's the whole vm_area_struct to 0, so there is no need to an additional vma_numab_state_init(). Link: https://lkml.kernel.org/r/20250213224655.1680278-16-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/debug: print vm_refcnt state when dumping the vmaSuren Baghdasaryan
vm_refcnt encodes a number of useful states: - whether vma is attached or detached - the number of current vma readers - presence of a vma writer Let's include it in the vma dump. Link: https://lkml.kernel.org/r/20250213224655.1680278-15-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: move lesser used vma_area_struct members into the last cachelineSuren Baghdasaryan
Move several vma_area_struct members which are rarely or never used during page fault handling into the last cacheline to better pack vm_area_struct. As a result vm_area_struct will fit into 3 as opposed to 4 cachelines. New typical vm_area_struct layout: struct vm_area_struct { union { struct { long unsigned int vm_start; /* 0 8 */ long unsigned int vm_end; /* 8 8 */ }; /* 0 16 */ freeptr_t vm_freeptr; /* 0 8 */ }; /* 0 16 */ struct mm_struct * vm_mm; /* 16 8 */ pgprot_t vm_page_prot; /* 24 8 */ union { const vm_flags_t vm_flags; /* 32 8 */ vm_flags_t __vm_flags; /* 32 8 */ }; /* 32 8 */ unsigned int vm_lock_seq; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head anon_vma_chain; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct anon_vma * anon_vma; /* 64 8 */ const struct vm_operations_struct * vm_ops; /* 72 8 */ long unsigned int vm_pgoff; /* 80 8 */ struct file * vm_file; /* 88 8 */ void * vm_private_data; /* 96 8 */ atomic_long_t swap_readahead_info; /* 104 8 */ struct mempolicy * vm_policy; /* 112 8 */ struct vma_numab_state * numab_state; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ refcount_t vm_refcnt (__aligned__(64)); /* 128 4 */ /* XXX 4 bytes hole, try to pack */ struct { struct rb_node rb (__aligned__(8)); /* 136 24 */ long unsigned int rb_subtree_last; /* 160 8 */ } __attribute__((__aligned__(8))) shared; /* 136 32 */ struct anon_vma_name * anon_name; /* 168 8 */ struct vm_userfaultfd_ctx vm_userfaultfd_ctx; /* 176 8 */ /* size: 192, cachelines: 3, members: 18 */ /* sum members: 176, holes: 2, sum holes: 8 */ /* padding: 8 */ /* forced alignments: 2, forced holes: 1, sum forced holes: 4 */ } __attribute__((__aligned__(64))); Memory consumption per 1000 VMAs becomes 48 pages: slabinfo after vm_area_struct changes: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 192 42 2 : ... Link: https://lkml.kernel.org/r/20250213224655.1680278-14-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: replace vm_lock and detached flag with a reference countSuren Baghdasaryan
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure: 1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails. 2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore and the vma->detached flag with a refcount (vm_refcnt). When vma is in detached state, vm_refcnt is 0 and only a call to vma_mark_attached() can take it out of this state. Note that unlike before, now we enforce both vma_mark_attached() and vma_mark_detached() to be done only after vma has been write-locked. vma_mark_attached() changes vm_refcnt to 1 to indicate that it has been attached to the vma tree. When a reader takes read lock, it increments vm_refcnt, unless the top usable bit of vm_refcnt (0x40000000) is set, indicating presence of a writer. When writer takes write lock, it sets the top usable bit to indicate its presence. If there are readers, writer will wait using newly introduced mm->vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. refcount might overflow if there are many competing readers, in which case read-locking will fail. Readers are expected to handle such failures. In summary: 1. all readers increment the vm_refcnt; 2. writer sets top usable (writer) bit of vm_refcnt; 3. readers cannot increment the vm_refcnt if the writer bit is set; 4. in the presence of readers, writer must wait for the vm_refcnt to drop to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma with no readers; 5. vm_refcnt overflow is handled by the readers. While this vm_lock replacement does not yet result in a smaller vm_area_struct (it stays at 256 bytes due to cacheline alignment), it allows for further size optimization by structure member regrouping to bring the size of vm_area_struct below 192 bytes. [surenb@google.com: fix a crash due to vma_end_read() that should have been removed] Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16refcount: introduce __refcount_{add|inc}_not_zero_limited_acquireSuren Baghdasaryan
Introduce functions to increase refcount but with a top limit above which they will fail to increase (the limit is inclusive). Setting the limit to INT_MAX indicates no limit. Link: https://lkml.kernel.org/r/20250213224655.1680278-12-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16refcount: provide ops for cases when object's memory can be reusedSuren Baghdasaryan
For speculative lookups where a successful inc_not_zero() pins the object, but where we still need to double check if the object acquired is indeed the one we set out to acquire (identity check), needs this validation to happen *after* the increment. Similarly, when a new object is initialized and its memory might have been previously occupied by another object, all stores to initialize the object should happen *before* refcount initialization. Notably SLAB_TYPESAFE_BY_RCU is one such an example when this ordering is required for reference counting. Add refcount_{add|inc}_not_zero_acquire() to guarantee the proper ordering between acquiring a reference count on an object and performing the identity check for that object. Add refcount_set_release() to guarantee proper ordering between stores initializing object attributes and the store initializing the refcount. refcount_set_release() should be done after all other object attributes are initialized. Once refcount_set_release() is called, the object should be considered visible to other tasks even if it was not yet added into an object collection normally used to discover it. This is because other tasks might have discovered the object previously occupying the same memory and after memory reuse they can succeed in taking refcount for the new object and start using it. Object reuse example to consider: consumer: obj = lookup(collection, key); if (!refcount_inc_not_zero_acquire(&obj->ref)) return; if (READ_ONCE(obj->key) != key) { /* identity check */ put_ref(obj); return; } use(obj->value); producer: remove(collection, obj->key); if (!refcount_dec_and_test(&obj->ref)) return; obj->key = KEY_INVALID; free(obj); obj = malloc(); /* obj is reused */ obj->key = new_key; obj->value = new_value; refcount_set_release(obj->ref, 1); add(collection, new_key, obj); refcount_{add|inc}_not_zero_acquire() is required to prevent the following reordering when refcount_inc_not_zero() is used instead: consumer: obj = lookup(collection, key); if (READ_ONCE(obj->key) != key) { /* reordered identity check */ put_ref(obj); return; } producer: remove(collection, obj->key); if (!refcount_dec_and_test(&obj->ref)) return; obj->key = KEY_INVALID; free(obj); obj = malloc(); /* obj is reused */ obj->key = new_key; obj->value = new_value; refcount_set_release(obj->ref, 1); add(collection, new_key, obj); if (!refcount_inc_not_zero(&obj->ref)) return; use(obj->value); /* USING WRONG OBJECT */ refcount_set_release() is required to prevent the following reordering when refcount_set() is used instead: consumer: obj = lookup(collection, key); producer: remove(collection, obj->key); if (!refcount_dec_and_test(&obj->ref)) return; obj->key = KEY_INVALID; free(obj); obj = malloc(); /* obj is reused */ obj->key = new_key; /* new_key == old_key */ refcount_set(obj->ref, 1); if (!refcount_inc_not_zero_acquire(&obj->ref)) return; if (READ_ONCE(obj->key) != key) { /* pass since new_key == old_key */ put_ref(obj); return; } use(obj->value); /* USING STALE obj->value */ obj->value = new_value; /* reordered store */ add(collection, key, obj); [surenb@google.com: fix title underlines in refcount-vs-atomic.rst] Link: https://lkml.kernel.org/r/20250217161645.3137927-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-11-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab] Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: uninline the main body of vma_start_write()Suren Baghdasaryan
vma_start_write() is used in many places and will grow in size very soon. It is not used in performance critical paths and uninlining it should limit the future code size growth. No functional changes. Link: https://lkml.kernel.org/r/20250213224655.1680278-10-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: move mmap_init_lock() out of the header fileSuren Baghdasaryan
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does not have to reside in the header file. This move lets us avoid including additional headers in mmap_lock.h later, when mmap_init_lock() needs to initialize rcuwait object. Link: https://lkml.kernel.org/r/20250213224655.1680278-9-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: allow vma_start_read_locked/vma_start_read_locked_nested to failSuren Baghdasaryan
With upcoming replacement of vm_lock with vm_refcnt, we need to handle a possibility of vma_start_read_locked/vma_start_read_locked_nested failing due to refcount overflow. Prepare for such possibility by changing these APIs and adjusting their users. Link: https://lkml.kernel.org/r/20250213224655.1680278-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16types: move struct rcuwait into types.hSuren Baghdasaryan
Move rcuwait struct definition into types.h so that rcuwait can be used without including rcuwait.h which includes other headers. Without this change mm_types.h can't use rcuwait due to a the following circular dependency: mm_types.h -> rcuwait.h -> signal.h -> mm_types.h Link: https://lkml.kernel.org/r/20250213224655.1680278-7-surenb@google.com Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: mark vmas detached upon exitSuren Baghdasaryan
When exit_mmap() removes vmas belonging to an exiting task, it does not mark them as detached since they can't be reached by other tasks and they will be freed shortly. Once we introduce vma reuse, all vmas will have to be in detached state before they are freed to ensure vma when reused is in a consistent state. Add missing vma_mark_detached() before freeing the vma. Link: https://lkml.kernel.org/r/20250213224655.1680278-6-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: introduce vma_iter_store_attached() to use with attached vmasSuren Baghdasaryan
vma_iter_store() functions can be used both when adding a new vma and when updating an existing one. However for existing ones we do not need to mark them attached as they are already marked that way. With vma->detached being a separate flag, double-marking a vmas as attached or detached is not an issue because the flag will simply be overwritten with the same value. However once we fold this flag into the refcount later in this series, re-attaching or re-detaching a vma becomes an issue since these operations will be incrementing/decrementing a refcount. Introduce vma_iter_store_new() and vma_iter_store_overwrite() to replace vma_iter_store() and avoid re-attaching a vma during vma update. Add assertions in vma_mark_attached()/vma_mark_detached() to catch invalid usage. Update vma tests to check for vma detached state correctness. Link: https://lkml.kernel.org/r/20250213224655.1680278-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: mark vma as detached until it's added into vma treeSuren Baghdasaryan
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is detached until it is added into mm's VMA tree. Fix this by marking new VMAs as detached and resetting detached flag only after VMA is added into a tree. Introduce vma_mark_attached() to make the API more readable and to simplify possible future cleanup when vma->vm_mm might be used to indicate detached vma and vma_mark_attached() will need an additional mm parameter. Link: https://lkml.kernel.org/r/20250213224655.1680278-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: move per-vma lock into vm_area_structSuren Baghdasaryan
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3]. Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. Move vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned as well. With kernel compiled using defconfig, this causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes: slabinfo before: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 160 51 2 : ... slabinfo after moving vm_lock: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 256 32 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, which is 5.5MB per 100000 VMAs. Note that the size of this structure is dependent on the kernel configuration and typically the original size is higher than 160 bytes. Therefore these calculations are close to the worst case scenario. A more realistic vm_area_struct usage before this change is: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 176 46 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages, which is 3.9MB per 100000 VMAs. This memory consumption growth can be addressed later by optimizing the vm_lock. [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/ [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250213224655.1680278-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: introduce vma_start_read_locked{_nested} helpersSuren Baghdasaryan
Patch series "reimplement per-vma lock as a refcount", v10. Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3]. Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. This patchset: 1. moves vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned to minimize cacheline sharing; 2. changes vm_area_struct initialization to mark new vma as detached until it is inserted into vma tree; 3. replaces vm_lock and vma->detached flag with a reference counter; 4. regroups vm_area_struct members to fit them into 3 cachelines; 5. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their reuse and to minimize call_rcu() calls. Pagefault microbenchmarks show performance improvement: Hmean faults/cpu-1 507926.5547 ( 0.00%) 506519.3692 * -0.28%* Hmean faults/cpu-4 479119.7051 ( 0.00%) 481333.6802 * 0.46%* Hmean faults/cpu-7 452880.2961 ( 0.00%) 455845.6211 * 0.65%* Hmean faults/cpu-12 347639.1021 ( 0.00%) 352004.2254 * 1.26%* Hmean faults/cpu-21 200061.2238 ( 0.00%) 229597.0317 * 14.76%* Hmean faults/cpu-30 145251.2001 ( 0.00%) 164202.5067 * 13.05%* Hmean faults/cpu-48 106848.4434 ( 0.00%) 120641.5504 * 12.91%* Hmean faults/cpu-56 92472.3835 ( 0.00%) 103464.7916 * 11.89%* Hmean faults/sec-1 507566.1468 ( 0.00%) 506139.0811 * -0.28%* Hmean faults/sec-4 1880478.2402 ( 0.00%) 1886795.6329 * 0.34%* Hmean faults/sec-7 3106394.3438 ( 0.00%) 3140550.7485 * 1.10%* Hmean faults/sec-12 4061358.4795 ( 0.00%) 4112477.0206 * 1.26%* Hmean faults/sec-21 3988619.1169 ( 0.00%) 4577747.1436 * 14.77%* Hmean faults/sec-30 3909839.5449 ( 0.00%) 4311052.2787 * 10.26%* Hmean faults/sec-48 4761108.4691 ( 0.00%) 5283790.5026 * 10.98%* Hmean faults/sec-56 4885561.4590 ( 0.00%) 5415839.4045 * 10.85%* This patch (of 18): Introduce helper functions which can be used to read-lock a VMA when holding mmap_lock for read. Replace direct accesses to vma->vm_lock with these new helpers. Link: https://lkml.kernel.org/r/20250213224655.1680278-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmapBarry Song
The try_to_unmap_one() function currently handles PMD-mapped THPs inefficiently. It first splits the PMD into PTEs, copies the dirty state from the PMD to the PTEs, iterates over the PTEs to locate the dirty state, and then marks the THP as swap-backed. This process involves unnecessary PMD splitting and redundant iteration. Instead, this functionality can be efficiently managed in __discard_anon_folio_pmd_locked(), avoiding the extra steps and improving performance. The following microbenchmark redirties folios after invoking MADV_FREE, then measures the time taken to perform memory reclamation (actually set those folios swapbacked again) on the redirtied folios. #include <stdio.h> #include <sys/mman.h> #include <string.h> #include <time.h> #define SIZE 128*1024*1024 // 128 MB int main(int argc, char *argv[]) { while(1) { volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset((void *)p, 1, SIZE); madvise((void *)p, SIZE, MADV_FREE); /* redirty after MADV_FREE */ memset((void *)p, 1, SIZE); clock_t start_time = clock(); madvise((void *)p, SIZE, MADV_PAGEOUT); clock_t end_time = clock(); double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; printf("Time taken by reclamation: %f seconds\n", elapsed_time); munmap((void *)p, SIZE); } return 0; } Testing results are as below, w/o patch: ~ # ./a.out Time taken by reclamation: 0.007300 seconds Time taken by reclamation: 0.007226 seconds Time taken by reclamation: 0.007295 seconds Time taken by reclamation: 0.007731 seconds Time taken by reclamation: 0.007134 seconds Time taken by reclamation: 0.007285 seconds Time taken by reclamation: 0.007720 seconds Time taken by reclamation: 0.007128 seconds Time taken by reclamation: 0.007710 seconds Time taken by reclamation: 0.007712 seconds Time taken by reclamation: 0.007236 seconds Time taken by reclamation: 0.007690 seconds Time taken by reclamation: 0.007174 seconds Time taken by reclamation: 0.007670 seconds Time taken by reclamation: 0.007169 seconds Time taken by reclamation: 0.007305 seconds Time taken by reclamation: 0.007432 seconds Time taken by reclamation: 0.007158 seconds Time taken by reclamation: 0.007133 seconds … w/ patch ~ # ./a.out Time taken by reclamation: 0.002124 seconds Time taken by reclamation: 0.002116 seconds Time taken by reclamation: 0.002150 seconds Time taken by reclamation: 0.002261 seconds Time taken by reclamation: 0.002137 seconds Time taken by reclamation: 0.002173 seconds Time taken by reclamation: 0.002063 seconds Time taken by reclamation: 0.002088 seconds Time taken by reclamation: 0.002169 seconds Time taken by reclamation: 0.002124 seconds Time taken by reclamation: 0.002111 seconds Time taken by reclamation: 0.002224 seconds Time taken by reclamation: 0.002297 seconds Time taken by reclamation: 0.002260 seconds Time taken by reclamation: 0.002246 seconds Time taken by reclamation: 0.002272 seconds Time taken by reclamation: 0.002277 seconds Time taken by reclamation: 0.002462 seconds … This patch significantly speeds up try_to_unmap_one() by allowing it to skip redirtied THPs without splitting the PMD. Link: https://lkml.kernel.org/r/20250214093015.51024-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chis Li <chrisl@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: support batched unmap for lazyfree large folios during reclamationBarry Song
Currently, the PTEs and rmap of a large folio are removed one at a time. This is not only slow but also causes the large folio to be unnecessarily added to deferred_split, which can lead to races between the deferred_split shrinker callback and memory reclamation. This patch releases all PTEs and rmap entries in a batch. Currently, it only handles lazyfree large folios. The below microbench tries to reclaim 128MB lazyfree large folios whose sizes are 64KiB: #include <stdio.h> #include <sys/mman.h> #include <string.h> #include <time.h> #define SIZE 128*1024*1024 // 128 MB unsigned long read_split_deferred() { FILE *file = fopen("/sys/kernel/mm/transparent_hugepage" "/hugepages-64kB/stats/split_deferred", "r"); if (!file) { perror("Error opening file"); return 0; } unsigned long value; if (fscanf(file, "%lu", &value) != 1) { perror("Error reading value"); fclose(file); return 0; } fclose(file); return value; } int main(int argc, char *argv[]) { while(1) { volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset((void *)p, 1, SIZE); madvise((void *)p, SIZE, MADV_FREE); clock_t start_time = clock(); unsigned long start_split = read_split_deferred(); madvise((void *)p, SIZE, MADV_PAGEOUT); clock_t end_time = clock(); unsigned long end_split = read_split_deferred(); double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n", elapsed_time, end_split - start_split); munmap((void *)p, SIZE); } return 0; } w/o patch: ~ # ./a.out Time taken by reclamation: 0.177418 seconds, split_deferred: 2048 Time taken by reclamation: 0.178348 seconds, split_deferred: 2048 Time taken by reclamation: 0.174525 seconds, split_deferred: 2048 Time taken by reclamation: 0.171620 seconds, split_deferred: 2048 Time taken by reclamation: 0.172241 seconds, split_deferred: 2048 Time taken by reclamation: 0.174003 seconds, split_deferred: 2048 Time taken by reclamation: 0.171058 seconds, split_deferred: 2048 Time taken by reclamation: 0.171993 seconds, split_deferred: 2048 Time taken by reclamation: 0.169829 seconds, split_deferred: 2048 Time taken by reclamation: 0.172895 seconds, split_deferred: 2048 Time taken by reclamation: 0.176063 seconds, split_deferred: 2048 Time taken by reclamation: 0.172568 seconds, split_deferred: 2048 Time taken by reclamation: 0.171185 seconds, split_deferred: 2048 Time taken by reclamation: 0.170632 seconds, split_deferred: 2048 Time taken by reclamation: 0.170208 seconds, split_deferred: 2048 Time taken by reclamation: 0.174192 seconds, split_deferred: 2048 ... w/ patch: ~ # ./a.out Time taken by reclamation: 0.074231 seconds, split_deferred: 0 Time taken by reclamation: 0.071026 seconds, split_deferred: 0 Time taken by reclamation: 0.072029 seconds, split_deferred: 0 Time taken by reclamation: 0.071873 seconds, split_deferred: 0 Time taken by reclamation: 0.073573 seconds, split_deferred: 0 Time taken by reclamation: 0.071906 seconds, split_deferred: 0 Time taken by reclamation: 0.073604 seconds, split_deferred: 0 Time taken by reclamation: 0.075903 seconds, split_deferred: 0 Time taken by reclamation: 0.073191 seconds, split_deferred: 0 Time taken by reclamation: 0.071228 seconds, split_deferred: 0 Time taken by reclamation: 0.071391 seconds, split_deferred: 0 Time taken by reclamation: 0.071468 seconds, split_deferred: 0 Time taken by reclamation: 0.071896 seconds, split_deferred: 0 Time taken by reclamation: 0.072508 seconds, split_deferred: 0 Time taken by reclamation: 0.071884 seconds, split_deferred: 0 Time taken by reclamation: 0.072433 seconds, split_deferred: 0 Time taken by reclamation: 0.071939 seconds, split_deferred: 0 ... Link: https://lkml.kernel.org/r/20250214093015.51024-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chis Li <chrisl@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: support tlbbatch flush for a range of PTEsBarry Song
This patch lays the groundwork for supporting batch PTE unmapping in try_to_unmap_one(). It introduces range handling for TLB batch flushing, with the range currently set to the size of PAGE_SIZE. The function __flush_tlb_range_nosync() is architecture-specific and is only used within arch/arm64. This function requires the mm structure instead of the vma structure. To allow its reuse by arch_tlbbatch_add_pending(), which operates with mm but not vma, this patch modifies the argument of __flush_tlb_range_nosync() to take mm as its parameter. Link: https://lkml.kernel.org/r/20250214093015.51024-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: set folio swapbacked iff folios are dirty in try_to_unmap_oneBarry Song
Patch series "mm: batched unmap lazyfree large folios during reclamation", v4. Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one. Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading. Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole. Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback. deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly — reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4). By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently — it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4). This patch (of 4): The refcount may be temporarily or long-term increased, but this does not change the fundamental nature of the folio already being lazy- freed. Therefore, we only reset 'swapbacked' when we are certain the folio is dirty and not droppable. Link: https://lkml.kernel.org/r/20250214093015.51024-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250214093015.51024-2-21cnbao@gmail.com Fixes: 6c8e2a256915 ("mm: fix race between MADV_FREE reclaim and blkdev direct IO read") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Chis Li <chrisl@kernel.org> (Google) Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gavin Shan <gshan@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16filemap: remove redundant folio_test_large check in filemap_free_folioGuanjun
The folio_test_large() check in filemap_free_folio() is unnecessary because folio_nr_pages(), which is called internally already performs this check. Removing the redundant condition simplifies the code and avoids double validation. This change improves code readability and reduces unnecessary operations in the folio freeing path. Link: https://lkml.kernel.org/r/20250213055612.490993-1-guanjun@linux.alibaba.com Signed-off-by: Guanjun <guanjun@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16maple_tree: remove a BUG_ON() in mas_alloc_nodes()Petr Tesarik
Remove a BUG_ON() right before a WARN_ON() with the same condition. Calling WARN_ON() and BUG_ON() here is definitely wrong. Since the goal is generally to remove BUG_ON() invocations from the kernel, keep only the WARN_ON(). Link: https://lkml.kernel.org/r/20250213114453.1078318-1-ptesarik@suse.com Fixes: 067311d33e65 ("maple_tree: separate ma_state node from status") Signed-off-by: Petr Tesarik <ptesarik@suse.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16tools/selftests: add file/shmem-backed mapping guard region testsLorenzo Stoakes
Extend the guard region self tests to explicitly assert that guard regions work correctly for functionality specific to file-backed and shmem mappings. In addition to testing all of the existing guard region functionality that is currently tested against anonymous mappings against file-backed and shmem mappings (except those which are exclusive to anonymous mapping), we now also: * Test that MADV_SEQUENTIAL does not cause unexpected readahead behaviour. * Test that MAP_PRIVATE behaves as expected with guard regions installed in both a shared and private mapping of an fd. * Test that a read-only file can correctly establish guard regions. * Test a probable fault-around case does not interfere with guard regions (or vice-versa). * Test that truncation does not eliminate guard regions. * Test that hole punching functions as expected in the presence of guard regions. * Test that a read-only mapping of a memfd write sealed mapping can have guard regions established within it and function correctly without violation of the seal. * Test that guard regions installed into a mapping of the anonymous zero page function correctly. Link: https://lkml.kernel.org/r/90c16bec5fcaafcd1700dfa3e9988c3e1aa9ac1d.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16tools/selftests: expand all guard region tests to file-backedLorenzo Stoakes
Extend the guard region tests to allow for test fixture variants for anon, shmem, and local file files. This allows us to assert that each of the expected behaviours of anonymous memory also applies correctly to file-backed (both shmem and an a file created locally in the current working directory) and thus asserts the same correctness guarantees as all the remaining tests do. The fixture teardown is now performed in the parent process rather than child forked ones, meaning cleanup is always performed, including unlinking any generated temporary files. Additionally the variant fixture data type now contains an enum value indicating the type of backing store and the mmap() invocation is abstracted to allow for the mapping of whichever backing store the variant is testing. We adjust tests as necessary to account for the fact they may now reference files rather than anonymous memory. Link: https://lkml.kernel.org/r/ab42228d2bd9b8aa18e9faebcd5c88732a7e5820.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: rename guard-pages to guard-regionsLorenzo Stoakes
The feature formerly referred to as guard pages is more correctly referred to as 'guard regions', as in fact no pages are ever allocated in the process of installing the regions. To avoid confusion, rename the tests accordingly. [lorenzo.stoakes@oracle.com: fix guard regions invocation] Link: https://lkml.kernel.org/r/13426c71-d069-4407-9340-b227ff8b8736@lucifer.local Link: https://lkml.kernel.org/r/1c3cd04a3f69b5756b94bda701ac88325a9be18b.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: allow guard regions in file-backed and read-only mappingsLorenzo Stoakes
Patch series "mm: permit guard regions for file-backed/shmem mappings". The guard regions feature was initially implemented to support anonymous mappings only, excluding shmem. This was done so as to introduce the feature carefully and incrementally and to be conservative when considering the various caveats and corner cases that are applicable to file-backed mappings but not to anonymous ones. Now this feature has landed in 6.13, it is time to revisit this and to extend this functionality to file-backed and shmem mappings. In order to make this maximally useful, and since one may map file-backed mappings read-only (for instance ELF images), we also remove the restriction on read-only mappings and permit the establishment of guard regions in any non-hugetlb, non-mlock()'d mapping. It is permissible to permit the establishment of guard regions in read-only mappings because the guard regions only reduce access to the mapping, and when removed simply reinstate the existing attributes of the underlying VMA, meaning no access violations can occur. While the change in kernel code introduced in this series is small, the majority of the effort here is spent in extending the testing to assert that the feature works correctly across numerous file-backed mapping scenarios. Every single guard region self-test performed against anonymous memory (which is relevant and not anon-only) has now been updated to also be performed against shmem and a mapping of a file in the working directory. This confirms that all cases also function correctly for file-backed guard regions. In addition a number of other tests are added for specific file-backed mapping scenarios. There are a number of other concerns that one might have with regard to guard regions, addressed below: Readahead ~~~~~~~~~ Readahead is a process through which the page cache is populated on the assumption that sequential reads will occur, thus amortising I/O and, through a clever use of the PG_readahead folio flag establishing during major fault and checked upon minor fault, provides for asynchronous I/O to occur as dat is processed, reducing I/O stalls as data is faulted in. Guard regions do not alter this mechanism which operates at the folio and fault level, but does of course prevent the faulting of folios that would otherwise be mapped. In the instance of a major fault prior to a guard region, synchronous readahead will occur including populating folios in the page cache which the guard regions will, in the case of the mapping in question, prevent access to. In addition, if PG_readahead is placed in a folio that is now inaccessible, this will prevent asynchronous readahead from occurring as it would otherwise do. However, there are mechanisms for heuristically resetting this within readahead regardless, which will 'recover' correct readahead behaviour. Readahead presumes sequential data access, the presence of a guard region clearly indicates that, at least in the guard region, no such sequential access will occur, as it cannot occur there. So this should have very little impact on any real workload. The far more important point is as to whether readahead causes incorrect or inappropriate mapping of ranges disallowed by the presence of guard regions - this is not the case, as readahead does not 'pre-fault' memory in this fashion. At any rate, any mechanism which would attempt to do so would hit the usual page fault paths, which correctly handle PTE markers as with anonymous mappings. Fault-Around ~~~~~~~~~~~~ The fault-around logic, in a similar vein to readahead, attempts to improve efficiency with regard to file-backed memory mappings, however it differs in that it does not try to fetch folios into the page cache that are about to be accessed, but rather pre-maps a range of folios around the faulting address. Guard regions making use of PTE markers makes this relatively trivial, as this case is already handled - see filemap_map_folio_range() and filemap_map_order0_folio() - in both instances, the solution is to simply keep the established page table mappings and let the fault handler take care of PTE markers, as per the comment: /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit * the fault-around logic. */ This works, as establishing guard regions results in page table mappings with PTE markers, and clearing them removes them. Truncation ~~~~~~~~~~ File truncation will not eliminate existing guard regions, as the truncation operation will ultimately zap the range via unmap_mapping_range(), which specifically excludes PTE markers. Zapping ~~~~~~~ Zapping is, as with anonymous mappings, handled by zap_nonpresent_ptes(), which specifically deals with guard entries, leaving them intact except in instances such as process teardown or munmap() where they need to be removed. Reclaim ~~~~~~~ When reclaim is performed on file-backed folios, it ultimately invokes try_to_unmap_one() via the rmap. If the folio is non-large, then map_pte() will ultimately abort the operation for the guard region mapping. If large, then check_pte() will determine that this is a non-device private entry/device-exclusive entry 'swap' PTE and thus abort the operation in that instance. Therefore, no odd things happen in the instance of reclaim being attempted upon a file-backed guard region. Hole Punching ~~~~~~~~~~~~~ This updates the page cache and ultimately invokes unmap_mapping_range(), which explicitly leaves PTE markers in place. Because the establishment of guard regions zapped any existing mappings to file-backed folios, once the guard regions are removed then the hole-punched region will be faulted in as usual and everything will behave as expected. One thing to note with this series is that it now implies file-backed VMAs which install guard regions will now have an anon_vma installed if not already present (i.e. if not post-CoW MAP_PRIVATE). I have audited kernel source for instances of vma->anon_vma checks and found nowhere where this would be problematic for pure file-backed mappings. I also discussed (off-list) with Matthew who confirmed he can't see any issue with this. In effect, we treat these VMAs as if they are MAP_PRIVATE, only with 0 CoW'd pages. As a result, the rmap never has a reason to reference the anon_vma from folios at any point and thus no unexpected or weird behaviour results. The anon_vma logic tries to avoid unnecessary anon_vma propagation on fork so we ought to at least minimise overhead. However, this is still overhead, and unwelcome overhead. The whole reason we do this (in madvise_guard_install()) is to ensure that fork _copies page tables_. Otherwise, in vma_needs_copy(), nothing will indicate that we should do so. This was already an unpleasant thing to have to do, but without a new VMA flag we really have no reasonable means of ensuring this happens. Going forward, I intend to add a new VMA flag, VM_MAYBE_GUARDED or something like this. This would have specific behaviour - for the purposes of merging, it would be ignored. However on both split and merge, it will be propagated. It is therefore 'sticky'. This is to avoid having to traverse page tables to determine which parts of a VMA contain guard regions and of course to maintain the desirable qualities of guard regions - the lack of VMA propagation (+ thus slab allocations of VMAs). Adding this flag and adjusting vma_needs_copy() to reference it would resolve the issue. However :) we have a VMA flag space issue, so it'd render this a 64-bit feature only. Having discussed with Matthew a plan by which to perhaps extend available flags for 32-bit we may going forward be able to avoid this. But this may be a longer term project. In the meantime, we'd have to resort to the anon_vma hack for 32-bit, using the flag for 64-bit. The issue with this however is if we do then intend to allow the flag to enable /proc/$pid/maps visibility (something this could allow), it would also end up being 64-bit only which would be a pity. Regardless - I wanted to highlight this behaviour as it is perhaps somewhat surprising. This patch (of 4): There is no reason to disallow guard regions in file-backed mappings - readahead and fault-around both function correctly in the presence of PTE markers, equally other operations relating to memory-mapped files function correctly. Additionally, read-only mappings if introducing guard-regions, only restrict the mapping further, which means there is no violation of any access rights by permitting this to be so. Removing this restriction allows for read-only mapped files (such as executable files) correctly which would otherwise not be permitted. Link: https://lkml.kernel.org/r/cover.1739469950.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d885cb259174736c2830a5dfe07f81b214ef3faa.1739469950.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mm_init.c: use round_up() to calculate usermap sizeWei Yang
Since pageblock_nr_pages and BITS_PER_LONG are power of 2, we could use round_up() to calculate it. Also we have renamed blockflags to pageblock_flags, adjust the comment accordingly. Link: https://lkml.kernel.org/r/20250212013818.873-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16selftests/mm: allow tests to run with no huge pages supportMark Brown
Currently the mm selftests refuse to run if huge pages are not available in the current system but this is an optional feature and not all the tests actually require them. Change the test during startup to be non-fatal and skip or omit tests which actually rely on having huge pages, allowing the other tests to be run. The gup_test does support using madvise() to configure huge pages but it ignores the error code so we just let it run. Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-2-44702f538522@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: clean up the stale code commentBaoquan He
In commit d7f861b9c43a ("mm/mmu_gather: add __tlb_remove_folio_pages()"), helper function __tlb_remove_folio_pages_size() was added. And based on the helper, wrapper functions __tlb_remove_folio_pages() and __tlb_remove_page_size() are created and used by upper level functions. So let's update the code comment to reflect the current code about tlb_remove_page()/tlb_remove_page_size(), etc. Link: https://lkml.kernel.org/r/20250211034348.39531-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: remove unused __tlb_remove_page()Baoquan He
Nobody is using __tlb_remove_page() now, clean it up. And also remove the code comment above tlb_remove_page() because it's not meaningful any more. Link: https://lkml.kernel.org/r/Z6si0/A/zzEF/bFJ@MiWiFi-R3L-srv Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16maple_tree: use ma_dead_node() in mte_dead_node()I Hsin Cheng
Utilize ma_dead_node() in mte_dead_node(). It can prevent decoding the maple enode for a second time. Use the "node" to find parent for comparison. Link: https://lkml.kernel.org/r/20250211071850.330632-1-richard120310@gmail.com Signed-off-by: I Hsin Cheng <richard120310@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Shuah khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mm_init.c: only align start of ZONE_MOVABLE on nodes with memoryWei Yang
At the beginning of find_zone_movable_pfns_for_nodes(), it has properly set node_states[N_MEMORY] in early_calculate_totalpages(). Instead of iterating over all possible nodes, we can just do the alignment on nodes with memory. Link: https://lkml.kernel.org/r/20250211082900.10877-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Docs/admin-guide/mm/damon/usage: document hugepage_size filter typeUsama Arif
This includes both the 'hugepage_size' filter type and the min/max files used to decide range of sizes to filter on. Link: https://lkml.kernel.org/r/20250211124437.278873-5-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Docs/ABI/damon: document DAMOS sysfs files to set the min/max folio_sizeUsama Arif
This will be used to decide the min and max folio size to operate on for pa_stat. Link: https://lkml.kernel.org/r/20250211124437.278873-4-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/sysfs-schemes: add files for setting damos_filter->sz_rangeUsama Arif
Add min and max files for damon filters to let the userspace decide the min/max folio size to operate on. This will be needed to decide what folio sizes to give pa_stat for. Link: https://lkml.kernel.org/r/20250211124437.278873-3-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: introduce DAMOS filter type hugepage_sizeUsama Arif
Patch series "mm/damon: add support for hugepage_size DAMOS filter", v5. hugepage_size DAMOS filter can be used to gather statistics to check if memory regions of specific access tempratures are backed by hugepages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. This patch (of 4): This is to gather statistics to check if memory regions of specific access tempratures are backed by pages of a size in a specific range. This filter can help to observe and prove the effectivenes of different schemes for shrinking/collapsing hugepages. [sj@kernel.org: add kernel-doc comment for damos_filter->sz_range] Link: https://lkml.kernel.org/r/20250218223058.52459-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250211124437.278873-1-usamaarif642@gmail.com Link: https://lkml.kernel.org/r/20250211124437.278873-2-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/mmu_gather: update comment on RCU freeingBrendan Jackman
Some recent discussion on LMKL [0] brought up some interesting and useful additional context on RCU-freeing for pagetables. Note down some extra info in here, in particular a) be concrete about the reason why an arch might not have an IPI and b) add the interesting paravirt details. [0] https://lore.kernel.org/linux-kernel/20250206044346.3810242-2-riel@surriel.com/ Link: https://lkml.kernel.org/r/20250211-mmugather-comment-v1-1-1ac1e0c765d2@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/vmstat: revert "fix a W=1 clang compiler warning"Bart Van Assche
Commit 75b5ab134bb5 enabled -Wenum-enum-conversion in W=1 builds. Commit 30c2de0a267c ("mm/vmstat: fix a W=1 clang compiler warning") fixed a -Wenum-enum-conversion warning. Commit 8f6629c004b1 removed the -Wenum-enum-conversion option again from W=1 builds. Since the W=1 compiler warning fix is no longer necessary, revert it. Link: https://lkml.kernel.org/r/20250211205911.1707684-1-bvanassche@acm.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ivan Shapovalov <intelfx@intelfx.name> Cc: David Laight <david.laight@aculab.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16fb_defio: do not use deprecated page->mapping, index fieldsLorenzo Stoakes
With the introduction of mapping_wrprotect_range() there is no need to use folio_mkclean() in order to write-protect mappings of frame buffer pages, and therefore no need to inappropriately set kernel-allocated page->index, mapping fields to permit this operation. Instead, store the pointer to the page cache object for the mapped driver in the fb_deferred_io object, and use the already stored page offset from the pageref object to look up mappings in order to write-protect them. This is justified, as for the page objects to store a mapping pointer at the point of assignment of pages, they must all reference the same underlying address_space object. Since the life time of the pagerefs is also the lifetime of the fb_deferred_io object, storing the pointer here makes sense. This eliminates the need for all of the logic around setting and maintaining page->index,mapping which we remove. This eliminates the use of folio_mkclean() entirely but otherwise should have no functional change. [lorenzo.stoakes@oracle.com: fixup unused variable warnings] Link: https://lkml.kernel.org/r/d4018405-2762-4385-a816-e54cc23839ac@lucifer.local Link: https://lkml.kernel.org/r/81171ab16c14e3df28f6de9d14982cee528d8519.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Kajtar Zsolt <soci@c64.rulez.org> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: provide mapping_wrprotect_range() functionLorenzo Stoakes
In the fb_defio video driver, page dirty state is used to determine when frame buffer pages have been changed, allowing for batched, deferred I/O to be performed for efficiency. This implementation had only one means of doing so effectively - the use of the folio_mkclean() function. However, this use of the function is inappropriate, as the fb_defio implementation allocates kernel memory to back the framebuffer, and then is forced to specified page->index, mapping fields in order to permit the folio_mkclean() rmap traversal to proceed correctly. It is not correct to specify these fields on kernel-allocated memory, and moreover since these are not folios, page->index, mapping are deprecated fields, soon to be removed. We therefore need to provide a means by which we can correctly traverse the reverse mapping and write-protect mappings for a page backing an address_space page cache object at a given offset. This patch provides this - mapping_wrprotect_range() - which allows for this operation to be performed for a specified address_space, offset, PFN and size, without requiring a folio nor, of course, an inappropriate use of page->index, mapping. With this provided, we can subsequently adjust the fb_defio implementation to make use of this function and avoid incorrect invocation of folio_mkclean() and more importantly, incorrect manipulation of page->index and mapping fields. Link: https://lkml.kernel.org/r/e5bf969d64e7f2f2ae944d42341fc8994b736a81.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Kajtar Zsolt <soci@c64.rulez.org> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: refactor rmap_walk_file() to separate out traversal logicLorenzo Stoakes
Patch series "expose mapping wrprotect, fix fb_defio use", v3. Right now the only means by which we can write-protect a range using the reverse mapping is via folio_mkclean(). However this is not always the appropriate means of doing so, specifically in the case of the framebuffer deferred I/O logic (fb_defio enabled by CONFIG_FB_DEFERRED_IO). There, kernel pages are mapped read-only and write-protect faults used to batch up I/O operations. Each time the deferred work is done, folio_mkclean() is used to mark the framebuffer page as having had I/O performed on it. However doing so requires the kernel page (perhaps allocated via vmalloc()) to have its page->mapping, index fields set so the rmap can find everything that maps it in order to write-protect. This is problematic as firstly, these fields should not be set for kernel-allocated memory, and secondly these are not folios (it's not user memory) and page->index, mapping fields are now deprecated and soon to be removed. The removal of these fields is imminent, rendering this series more urgent than it might first appear. The implementers cannot be blamed for having used this however, as there is simply no other way of performing this operation correctly. This series fixes this - we provide the mapping_wrprotect_range() function to allow the reverse mapping to be used to look up mappings from the page cache object (i.e. its address_space pointer) at a specific offset. The fb_defio logic already stores this offset, and can simply be expanded to keep track of the page cache object, so the change then becomes straight-forward. This series should have no functional change. This patch (of 3): In order to permit the traversal of the reverse mapping at a specified mapping and offset rather than those specified by an input folio, we need to separate out the portion of the rmap file logic which deals with this traversal from those parts of the logic which interact with the folio. This patch achieves this by adding a new static __rmap_walk_file() function which rmap_walk_file() invokes. This function permits the ability to pass NULL folio, on the assumption that the caller has provided for this correctly in the callbacks specified in the rmap_walk_control object. Though it provides for this, and adds debug asserts to ensure that, should a folio be specified, these are equal to the mapping and offset specified in the folio, there should be no functional change as a result of this patch. The reason for adding this is to enable for future changes to permit users to be able to traverse mappings of userland-mapped kernel memory, write-protecting those mappings to enable page_mkwrite() or pfn_mkwrite() fault handlers to be retriggered on subsequent dirty. Link: https://lkml.kernel.org/r/cover.1739029358.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/0d1acec0cba1e5a12f9b53efcabc397541c90517.1739029358.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Jaya Kumar <jayakumar.lkml@gmail.com> Cc: Kajtar Zsolt <soci@c64.rulez.org> Cc: Maíra Canal <mcanal@igalia.com> Cc: Matthew Wilcox <willy@infradead.org> [English fixes] Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Thomas Zimemrmann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>