Age | Commit message (Collapse) | Author |
|
Patch series "mm: introduce snapshot_page()", v3.
This series introduces snapshot_page(), a helper function that can be used
to create a snapshot of a struct page and its associated struct folio.
This function is intended to help callers with a consistent view of a a
folio while reducing the chance of encountering partially updated or
inconsistent state, such as during folio splitting which could lead to
crashes and BUG_ON()s being triggered.
This patch (of 4):
Let's avoid working with the PMD when not required. If
vm_normal_page_pmd() would be called on something that is not a present
pmd, it would already be a bug (pfn possibly garbage).
While at it, let's support passing in any pfn covered by the huge zero
folio by masking off PFN bits -- which should be rather cheap.
Link: https://lkml.kernel.org/r/cover.1752499009.git.luizcap@redhat.com
Link: https://lkml.kernel.org/r/4940826e99f0c709a7cf7beb94f53288320aea5a.1752499009.git.luizcap@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As cluster_next_cpu was already dropped, the associated comment is stale
now.
Link: https://lkml.kernel.org/r/20250522122554.12209-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In setup_swap_map(), we only ensure badpages are in range (0, last_page].
As maxpages might be < last_page, setup_clusters() will encounter a buffer
overflow when a badpage is >= maxpages.
Only call inc_cluster_info_page() for badpage which is < maxpages to fix
the issue.
Link: https://lkml.kernel.org/r/20250522122554.12209-4-shikemeng@huaweicloud.com
Fixes: b843786b0bd0 ("mm: swapfile: fix SSD detection with swapfile on btrfs")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We use maxpages from read_swap_header() to initialize swap_info_struct,
however the maxpages might be reduced in setup_swap_extents() and the
si->max is assigned with the reduced maxpages from the
setup_swap_extents().
Obviously, this could lead to memory waste as we allocated memory based on
larger maxpages, besides, this could lead to a potential deadloop as
following:
1) When calling setup_clusters() with larger maxpages, unavailable
pages within range [si->max, larger maxpages) are not accounted with
inc_cluster_info_page(). As a result, these pages are assumed
available but can not be allocated. The cluster contains these pages
can be moved to frag_clusters list after it's all available pages were
allocated.
2) When the cluster mentioned in 1) is the only cluster in
frag_clusters list, cluster_alloc_swap_entry() assume order 0
allocation will never failed and will enter a deadloop by keep trying
to allocate page from the only cluster in frag_clusters which contains
no actually available page.
Call setup_swap_extents() to get the final maxpages before
swap_info_struct initialization to fix the issue.
After this change, span will include badblocks and will become large
value which I think is correct value:
In summary, there are two kinds of swapfile_activate operations.
1. Filesystem style: Treat all blocks logical continuity and find
usable physical extents in logical range. In this way, si->pages will
be actual usable physical blocks and span will be "1 + highest_block -
lowest_block".
2. Block device style: Treat all blocks physically continue and only
one single extent is added. In this way, si->pages will be si->max and
span will be "si->pages - 1". Actually, si->pages and si->max is only
used in block device style and span value is set with si->pages. As a
result, span value in block device style will become a larger value as
you mentioned.
I think larger value is correct based on:
1. Span value in filesystem style is "1 + highest_block -
lowest_block" which is the range cover all possible phisical blocks
including the badblocks.
2. For block device style, si->pages is the actual usable block number
and is already in pr_info. The original span value before this patch
is also refer to usable block number which is redundant in pr_info.
[shikemeng@huaweicloud.com: ensure si->pages == si->max - 1 after setup_swap_extents()]
Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250718065139.61989-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com
Fixes: 661383c6111a ("mm: swap: relaim the cached parts that got scanned")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swap_range_alloc()
Patch series "Some randome fixes and cleanups to swapfile".
Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can
be found in respective patches.
This patch (of 4):
When folio_alloc_swap() encounters a failure in either
mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter
is not decremented for allocated entry. However, the following
put_swap_folio() will increase nr_swap_pages counter unpairly and lead to
an imbalance.
Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc()
to pair the nr_swap_pages counting.
Link: https://lkml.kernel.org/r/20250522122554.12209-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250522122554.12209-2-shikemeng@huaweicloud.com
Fixes: 0ff67f990bd4 ("mm, swap: remove swap slot cache")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Document the new DAMON sysfs file, refresh_ms, on the ABI document.
Link: https://lkml.kernel.org/r/20250717055448.56976-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Document the new DAMON sysfs file, refresh_ms, on the usage document.
Link: https://lkml.kernel.org/r/20250717055448.56976-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Only minimum file operations for refresh_ms file is implemented. Further
implement its designed behavior, the periodic essential files content
update, using repeat mode damon_call().
If non-zero value is written to the file, update DAMON sysfs files for
auto-tuned monitoring intervals, DAMOS stats, and auto-tuned DAMOS quota
values, which are essential to be monitored in most DAMON use cases. The
user-written non-zero value becomes the time delay between the update. If
zero is written to the file, the periodic refresh is disabled.
Link: https://lkml.kernel.org/r/20250717055448.56976-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/sysfs: support periodic and automated stats
update".
DAMON sysfs interface provides files for reading DAMON internal status
including auto-tuned monitoring intervals, DAMOS stats, DAMOS action
applied regions, and auto-tuned DAMOS effective quota. Among those,
auto-tuned monitoring intervals, DAMOS stats and auto-tuned DAMOS
effective quota are essential for common DAMON/S use cases.
The content of the files are not automatically updated, though. Users
should manually request updates of the contents by writing a special
command to 'state' file of each kdamond directory. This interface is good
for minimizing overhead, but causes the below problems.
First, the usage is cumbersome. This is arguably not a big problem, since
the user-space tool (damo) can do this instead of the user.
Second, it can be too slow. The update request is not directly handled by
the sysfs interface but kdamond thread. And kdamond threads wake up only
once per the sampling interval. Hence if sampling interval is not short,
each update request could take too long time. The recommended sampling
interval setup is asking DAMON to automatically tune it, within a range
between 5 milliseconds and 10 seconds. On production systems it is not
very rare to have a few seconds sampling interval as a result of the
auto-tuning, so this can disturb observing DAMON internal status.
Finally, parallel update requests can conflict with each other. When
parallel update requests are received, DAMON sysfs interface simply
returns -EBUSY to one of the requests. DAMON user-space tool is hence
implementing its own backoff mechanism, but this can make the operation
even slower.
Introduce a new sysfs file, namely refresh_ms, for asking DAMON sysfs
interface to repeat the update of the above mentioned essential contents
with a user-specified time delay. If non-zero value is written to the
file, DAMON sysfs interface does the updates for essential DAMON internal
status including auto-tuned monitoring intervals, DAMOS stats, and
auto-tuned DAMOS quotas using the user-written value as the time delay.
In other words, it is similar to periodically writing
'update_schemes_stats', 'update_schemes_effective_quotas', and
'update_tuned_intervals' keywords to the 'state' file. If zero is written
to the file, the automatic refresh is disabled.
This patch (of 4):
Implement a new DAMON sysfs file named 'refresh_ms' under each kdamond
directory. The file will be used as a control knob of automatic refresh
of a few DAMON internal status files. This commit implements only minimum
file operations, though. The automatic refresh feature will be
implemented by the following commit.
Link: https://lkml.kernel.org/r/20250717055448.56976-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250717055448.56976-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
memcg->socket_pressure is initialised with jiffies when the memcg is
created.
Once vmpressure detects that the cgroup is under memory pressure, the
field is updated with jiffies + HZ to signal the fact to the socket layer
and suppress memory allocation for one second.
Otherwise, the field is not updated.
mem_cgroup_under_socket_pressure() uses time_before() to check if jiffies
is less than memcg->socket_pressure, and this has a bug on 32-bit kernel.
if (time_before(jiffies, memcg->socket_pressure))
return true;
As time_before() casts the final result to long, the acceptable delta
between two timestamps is 2 ^ (BITS_PER_LONG - 1).
On 32-bit kernel with CONFIG_HZ=1000, this is about 24 days.
>>> (2 ** 31) / 1000 / 60 / 60 / 24
24.855134814814818
Once 24 days have passed since the last update of socket_pressure,
mem_cgroup_under_socket_pressure() starts to lie until the next 24 days
pass.
We don't need to worry about this on 64-bit machines unless they serve for
300 million years.
>>> (2 ** 63) / 1000 / 60 / 60 / 24 / 365
292471208.6775361
Let's convert memcg->socket_pressure to u64.
Performance teting:
I don't have a real 32-bit machine so this is a result on QEMU, but
with/without the u64 jiffie patch, the time spent in
mem_cgroup_under_socket_pressure() was 1~5us and I didn't see any
measurable delta.
no patch applied:
iperf3 273 [000] 137.296248:
probe:mem_cgroup_under_socket_pressure: (c13660d0)
c13660d1 mem_cgroup_under_socket_pressure+0x1
([kernel.kallsyms])
iperf3 273 [000] 137.296249:
probe:mem_cgroup_under_socket_pressure__return: (c13660d0 <- c1d8fd7f)
iperf3 273 [000] 137.296251:
probe:mem_cgroup_under_socket_pressure: (c13660d0)
c13660d1 mem_cgroup_under_socket_pressure+0x1
([kernel.kallsyms])
iperf3 273 [000] 137.296253:
probe:mem_cgroup_under_socket_pressure__return: (c13660d0 <- c1d8fd7f)
u64 jiffies patch applied:
iperf3 308 [001] 330.669370:
probe:mem_cgroup_under_socket_pressure: (c12ddba0)
c12ddba1 mem_cgroup_under_socket_pressure+0x1
([kernel.kallsyms])
iperf3 308 [001] 330.669371:
probe:mem_cgroup_under_socket_pressure__return: (c12ddba0 <- c1ce98bf)
iperf3 308 [001] 330.669382:
probe:mem_cgroup_under_socket_pressure: (c12ddba0)
c12ddba1 mem_cgroup_under_socket_pressure+0x1
([kernel.kallsyms])
iperf3 308 [001] 330.669384:
probe:mem_cgroup_under_socket_pressure__return: (c12ddba0 <- c1ce98bf)
So the u64 approach is good enough.
Link: https://lkml.kernel.org/r/20250717194645.1096500-1-kuniyu@google.com
Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <ncardwell@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit 4b634918384c ("arm64/mm: Close theoretical race where stale
TLB entry remains valid"), all arches that use tlbbatch for reclaim
(arm64, riscv, x86) implement arch_flush_tlb_batched_pending() with a
flush_tlb_mm().
So let's simplify by removing the unnecessary abstraction and doing the
flush_tlb_mm() directly in flush_tlb_batched_pending(). This effectively
reverts commit db6c1f6f236d ("mm/tlbbatch: introduce
arch_flush_tlb_batched_pending()").
Link: https://lkml.kernel.org/r/20250609103132.447370-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Suggested-by: Will Deacon <will@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are no longer any callers of hugetlb_free_pgd_range().
Link: https://lkml.kernel.org/r/20250716012611.10369-4-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
With the removal of the last arch-specific implementation of
hugetlb_free_pgd_range(), hugetlb VMAs no longer need special handling
when freeing page tables.
Link: https://lkml.kernel.org/r/20250716012611.10369-3-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "drop hugetlb_free_pgd_range()".
For all architectures that support hugetlb except for sparc,
hugetlb_free_pgd_range() just calls free_pgd_range(). It turns out the
sparc implementation is essentially identical to free_pgd_range() and can
be removed. Remove it and update free_pgtables() to treat hugetlb VMAs
the same as others.
This patch (of 3):
The sparc implementation of hugetlb_free_pgd_range() is identical to
free_pgd_range() with the exception of checking for and skipping possible
leaf entries at the PUD and PMD levels.
These checks are unnecessary because any huge pages have been freed and
their PTEs cleared by the time page tables needed to map them are freed.
While some huge page sizes do populate the page table with multiple PTEs,
they are correctly cleared by huge_ptep_get_and_clear().
To verify this, libhugetlbfs tests were run for 64K, 8M, and 256M page
sizes with an instrumented kernel on a qemu guest modified to support the
256M page size. The same tests were used to verify no regressions after
applying this patch and were also run on x86 for both 2M and 1G page
sizes.
Link: https://lkml.kernel.org/r/20250716012611.10369-1-anthony.yznaga@oracle.com
Link: https://lkml.kernel.org/r/20250716012611.10369-2-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
zswap cannot compress and memcg disables writeback), there is no virtue in
keeping that folio in swap cache and holding the swap allocation:
shmem_writeout() switch it back to shmem page cache before returning.
Folio lock is held, and folio->memcg_data remains set throughout, so there
is no need to get into any memcg or memsw charge complications:
swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
beware the race with shmem_free_swap() when inode truncated or evicted).
Doing the same for an anonymous folio is harder, since it will usually
have been unmapped, with references to the swap left in the page tables.
Adding a function to remap the folio would be fun, but not worthwhile
unless it has other uses, or an urgent bug with anon is demonstrated.
[hughd@google.com: use shmem_recalc_inode() rather than open coding, per Baolin]
Link: https://lkml.kernel.org/r/101a7d89-290c-545d-8a6d-b1174ed8b1e5@google.com
Link: https://lkml.kernel.org/r/5c911f7a-af7a-5029-1dd4-2e00b66d565c@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
global shmem_swaplist_mutex worryingly hot: improvement is long overdue.
3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
and hoped to find another way: yes, there may be lots of work to allocate
radix tree nodes in there. Then 6.15 commit b487a2da3575 ("mm, swap:
simplify folio swap allocation") will have made it worse, by moving
shmem_writeout()'s swap allocation under that mutex too (but the worrying
flamegraph was observed even before that change).
There's a useful comment about pagelock no longer protecting from eviction
once moved to swap cache: but it's good till
shmem_delete_from_page_cache() replaces page pointer by swap entry, so
move the swaplist add between them.
We would much prefer to take the global lock once per inode than once per
page: given the possible races with shmem_unuse() pruning when !swapped
(and other tasks racing to swap other pages out or in), try the swaplist
add whenever swapped was incremented from 0 (but inode may already be on
the list - only unuse and evict bother to remove it).
This technique is more subtle than it looks (we're avoiding the very lock
which would make it easy), but works: whereas an unlocked list_empty()
check runs a risk of the inode being unqueued and left off the swaplist
forever, swapoff only completing when the page is faulted in or removed.
The need for a sleepable mutex went away in 5.1 commit b56a2d8af914 ("mm:
rid swapoff of quadratic complexity"): a spinlock works better now.
This commit is certain to take shmem_swaplist_mutex out of contention, and
has been seen to make a practical improvement (but there is likely to have
been an underlying issue which made its contention so visible).
Link: https://lkml.kernel.org/r/87beaec6-a3b0-ce7a-c892-1e1e5bd57aa3@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that we have added the ability to move multiple VMAs at once, assert
that this functions correctly, both overwriting VMAs and moving backwards
and forwards with merge and VMA invalidation.
Additionally assert that page tables are correctly propagated by setting
random data and reading it back.
Link: https://lkml.kernel.org/r/139074a24a011ca4ed52498a7fa2080024b43917.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When an mlock()'d VMA is expanded, we need to populate the expanded region
to maintain the contract that all mlock()'d memory is present (albeit -
with some period after mmap unlock where the expanded part of the mapping
remains unfaulted).
The current implementation is very unclear, so make it absolutely explicit
under what circumstances we do this.
Link: https://lkml.kernel.org/r/2358b0006baa9cab83db4259817794f16fe1992e.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Group parameter check logic together, moving check_mremap_params() next to
it.
This puts all such checks into a single place, and invokes them early so
we can simply bail out as soon as we are aware that a condition is not
met.
No functional change intended.
Link: https://lkml.kernel.org/r/4d0669c23531629d8ead42aa701c6237bd6bf012.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When we expand or move a VMA, this requires a number of additional checks
to be performed.
Make it really obvious under what circumstances these checks must be
performed and aggregate all the checks in one place by invoking this in
check_prep_vma().
We have to adjust the checks to account for shrink + move operations by
checking new_len <= old_len rather than new_len == old_len.
No functional change intended.
[lorenzo.stoakes@oracle.com: allow undocumented mremap() shrink behaviour]
Link: https://lkml.kernel.org/r/8fc92a38-c636-465e-9a2f-2c6ac9cb49b8@lucifer.local
Link: https://lkml.kernel.org/r/8b4161ce074901e00602a446d81f182db92b0430.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.
This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.
A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.
That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.
No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.
However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.
At this point, it is not correct to raise a uffd remap event, and we must
handle it.
This refactoring makes it abundantly clear what we are doing.
We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.
No functional change intended.
Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Separate out the uffd bits so it clear's what's happening.
Don't bother setting vrm->mmap_locked after unlocking, because after this
we are done anyway.
The only time we drop the mmap lock is on VMA shrink, at which point
vrm->new_len will be < vrm->old_len and the operation will not be
performed anyway, so move this code out of the if (vrm->mmap_locked)
block.
All addresses returned by mremap() are page-aligned, so the
offset_in_page() check on ret seems only to be incorrectly trying to
detect whether an error occurred - explicitly check for this.
No functional change intended.
Link: https://lkml.kernel.org/r/ebb8f29650b8e343fe98fefc67b3a61a24d1e0f1.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Rather than lumping everything together in do_mremap(), add a new helper
function, check_prep_vma(), to do the work relating to each VMA.
This further lays groundwork for subsequent patches which will allow for
batched VMA mremap().
Additionally, if we set vrm->new_addr == vrm->addr when prepping the VMA,
this avoids us needing to do so in the expand VMA mlocked case.
No functional change intended.
Link: https://lkml.kernel.org/r/15efa3c57935f7f8894094b94c1803c2f322c511.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We are currently checking some things later, and some things immediately.
Aggregate the checks and avoid ones that need not be made.
Simplify things by aligning lengths immediately. Defer setting the delta
parameter until later, which removes some duplicate code in the hugetlb
case.
We can safely perform the checks moved from mremap_to() to
check_mremap_params() because:
* If we set a new address via vrm_set_new_addr(), then this is guaranteed
to not overlap nor to position the new VMA past TASK_SIZE, so there's no
need to check these later.
* We can simply page align lengths immediately. We do not need to check for
overlap nor TASK_SIZE sanity after hugetlb alignment as this asserts
addresses are huge-aligned, then huge-aligns lengths, rounding down. This
means any existing overlap would have already been caught.
Moving things around like this lays the groundwork for subsequent changes
to permit operations on batches of VMAs.
No functional change intended.
Link: https://lkml.kernel.org/r/c862d625c98b1abd861c406f2bfad8baf3287f83.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The single instance in which we use this function doesn't actually need to
change VMA flags, so remove this parameter and update the caller
accordingly.
[lorenzo.stoakes@oracle.com: correct comment]
Link: https://lkml.kernel.org/r/77f45b2e-a748-4635-9381-a5051091087f@lucifer.local
Link: https://lkml.kernel.org/r/20250714135839.178032-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Dropping a lock, just to demand it again for an afterthought, cannot be
good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq().
[hughd@google.com: delete unneeded comment]
Link: https://lkml.kernel.org/r/dbf9352a-1ed9-a021-c0c7-9309ac73e174@google.com
Link: https://lkml.kernel.org/r/21100102-51b6-79d5-03db-1bb7f97fa94c@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Tested-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In try_to_inc_min_seq(), if min_seq[type] has not increased. In other
words, min_seq[type] == lrugen->min_seq[type]. Then we should return
directly to avoid unnecessary overhead later.
Corollary: If min_seq[type] of both anonymous and file is not increased,
try_to_inc_min_seq() will fail.
Proof:
It is known that min_seq[type] has not increased, that is, min_seq[type]
is equal to lrugen->min_seq[type], then the following:
case 1: min_seq[type] has not been reassigned and changed before
judgment min_seq[type] <= lrugen->min_seq[type].
Then the subsequent min_seq[type] <= lrugen->min_seq[type] judgment
will always be true.
case 2: min_seq[type] is reassigned to seq, before judgment
min_seq[type] <= lrugen->min_seq[type].
Then at least the condition of min_seq[type] > seq must be met
before min_seq[type] will be reassigned to seq.
That is to say, before the reassignment, lrugen->min_seq[type] > seq
is met, and then min_seq[type] = seq.
Then the following min_seq[type](seq) <= lrugen->min_seq[type] judgment
is always true.
Therefore, in try_to_inc_min_seq(), If min_seq[type] of both anonymous
and file is not increased, we can return false directly to avoid
unnecessary overhead.
Link: https://lkml.kernel.org/r/20250703023946.65315-1-jiahao.kernel@gmail.com
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Suggested-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kinsey Ho <kinseyho@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both callers of set_page_owner_migrate_reason() use folios. Convert the
function to take a folio directly and move the &folio->page conversion
inside __set_page_owner_migrate_reason().
Link: https://lkml.kernel.org/r/20250711145910.90135-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
strcpy() is deprecated; use memcpy() instead.
Not copying the NUL terminator is safe because strncpy_from_user() would
overwrite it anyway by appending uname to the destination buffer at index
MFD_NAME_PREFIX_LEN.
No functional changes intended.
Link: https://github.com/KSPP/linux/issues/88
Link: https://lkml.kernel.org/r/20250712174516.64243-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All damon_callback usages are replicated by damon_call() and damos_walk().
Time to say goodbye. Remove damon_callback.
Link: https://lkml.kernel.org/r/20250712195016.151108-15-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON core layer does target cleanup on its own. Remove duplicated and
unnecessarily selective cleanup attempts in DAMON sysfs interface.
Link: https://lkml.kernel.org/r/20250712195016.151108-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When kdamond_fn() completes, the targets are kept. Those are kept to let
callers do additional cleanups if they need. There are no such additional
cleanups though. DAMON sysfs interface deallocates those in
before_terminate() callback, to reduce unnecessary memory usage, for
[f]vaddr use case. Just destroy the targets for every case in the core
layer. This saves more memory and simplifies the logic.
Link: https://lkml.kernel.org/r/20250712195016.151108-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The function was introduced for putting pids and deallocating unnecessary
targets. Hence it is called before damon_destroy_ctx(). Now vaddr puts
pid for each target destruction (cleanup_target()). damon_destroy_ctx()
deallocates the targets anyway. So damon_sysfs_destroy_targets() has no
reason to exist. Remove it.
Link: https://lkml.kernel.org/r/20250712195016.151108-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Implement cleanup_target() callback for [f]vaddr, which calls put_pid()
for each target that will be destroyed. Also remove redundant put_pid()
calls in core, sysfs and sample modules, which were required to be done
redundantly due to the lack of such self cleanup in vaddr.
Link: https://lkml.kernel.org/r/20250712195016.151108-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some DAMON operation sets may need additional cleanup per target. For
example, [f]vaddr need to put pids of each target. Each user and core
logic is doing that redundantly. Add another DAMON ops callback that will
be used for doing such cleanups in operations set layer.
[sj@kernel.org: add kernel-doc comment for damon_operations->cleanup_target]
Link: https://lkml.kernel.org/r/20250715185239.89152-2-sj@kernel.org
[sj@kernel.org: remove damon_ctx->callback kernel-doc comment]
Link: https://lkml.kernel.org/r/20250715185239.89152-3-sj@kernel.org
Link: https://lkml.kernel.org/r/20250712195016.151108-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_operations.cleanup() is documented to be called for kdamond
termination, but also being called for targets destruction, which is done
for any damon_ctx destruction. Nobody is using the callback for now,
though. Remove the cleanup() call under the destruction.
Link: https://lkml.kernel.org/r/20250712195016.151108-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
wsse uses damon_callback for periodically reading DAMON internal data.
Use its alternative, damon_call() repeat mode.
Link: https://lkml.kernel.org/r/20250712195016.151108-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
prcl uses damon_callback for periodically reading DAMON internal data.
Use its alternative, damon_call() repeat mode.
Link: https://lkml.kernel.org/r/20250712195016.151108-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_LRU_SORT uses damon_callback for periodically reading and writing
DAMON internal data and parameters. Use its alternative, damon_call()
repeat mode.
Link: https://lkml.kernel.org/r/20250712195016.151108-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_RECLAIM uses damon_callback for periodically reading and writing
DAMON internal data and parameters. Use its alternative, damon_call()
repeat mode.
Link: https://lkml.kernel.org/r/20250712195016.151108-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON_STAT uses damon_callback for periodically reading DAMON internal
data. Use its alternative, damon_call() repeat mode.
Link: https://lkml.kernel.org/r/20250712195016.151108-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_call() can be useful for reading or writing DAMON internal data for
one time. A common pattern of DAMON core usage from DAMON modules is
doing such reads and writes repeatedly, for example, to periodically
update the DAMOS stats. To do that with damon_call(), callers should call
damon_call() repeatedly, with their own delay loop. Each caller doing
that is repetitive. Introduce a repeat mode damon_call(). Callers can
use the mode by setting a new field in damon_call_control. If the mode is
turned on, damon_call() returns success immediately, and DAMON repeats
invoking the callback function inside the kdamond main loop.
Link: https://lkml.kernel.org/r/20250712195016.151108-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon: remove damon_callback".
damon_callback was the only way for communicating with DAMON for contexts
running on its worker thread. The interface is flexible and simple. But
as DAMON evolves with more features, damon_callback has become somewhat
too old. With runtime parameters update, for example, its lack of
synchronization support was found to be inconvenient. Arguably it is also
not easy to use correctly since the callers should understand when each
callback is called, and implication of the return values from the
callbacks.
To replace it, damon_call() and damos_walk() are introduced. And those
replaced a few damon_callback use cases. Some use cases of damon_callback
such as parallel or repetitive DAMON internal data reading and additional
cleanups cannot simply be replaced by damon_call() and damos_walk(),
though.
To allow those replaceable, extend damon_call() for parallel and/or
repeated callbacks and modify the core/ops layers for additional resources
cleanup. With the updates, replace the remaining damon_callback usages
and finally say goodbye to damon_callback.
This patch (of 14):
Calling damon_call() while it is serving for another parallel thread
immediately fails with -EBUSY. The caller should call it again, later.
Each caller implementing such retry logic would be redundant. Accept
parallel damon_call() requests and do the wait instead of the caller.
Link: https://lkml.kernel.org/r/20250712195016.151108-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250712195016.151108-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Set min_brk to mm->start_brk by default, and override it with mm->end_data
only when CONFIG_COMPAT_BRK is enabled and brk_randomized is false.
This makes the logic clearer with no functional change.
Link: https://lkml.kernel.org/r/20250710025859.926355-1-liuqiye2025@163.com
Signed-off-by: Xuanye Liu <liuqiye2025@163.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio_nr_pages() is faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.
Link: https://lkml.kernel.org/r/20250710060451.3535957-1-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When pmd_to_hmm_pfn_flags() is unused, it prevents kernel builds with
clang, `make W=1` and CONFIG_TRANSPARENT_HUGEPAGE=n:
mm/hmm.c:186:29: warning: unused function 'pmd_to_hmm_pfn_flags' [-Wunused-function]
Fix this by moving the function to the respective existing ifdeffery
for its the only user.
See also:
6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build")
Link: https://lkml.kernel.org/r/20250710082403.664093-1-andriy.shevchenko@linux.intel.com
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This adds support for allowing proactive reclaim in general on a NUMA
system. A per-node interface extends support for beyond a memcg-specific
interface, respecting the current semantics of memory.reclaim: respecting
aging LRU and not supporting artificially triggering eviction on nodes
belonging to non-bottom tiers.
This patch allows userspace to do:
echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim
One of the premises for this is to semantically align as best as possible
with memory.reclaim. During a brief time memcg did support nodemask until
55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which
semantics around reclaim (eviction) vs demotion were not clear, rendering
charging expectations to be broken.
With this approach:
1. Users who do not use memcg can benefit from proactive reclaim. The
memcg interface is not NUMA aware and there are usecases that are
focusing on NUMA balancing rather than workload memory footprint.
2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes will
trigger evicting to swap (the traditional sense of reclaim). This
follows the semantics of what is today part of the aging process on
tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.
3. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.
4. Users that *really* want to free up memory can use proactive
reclaim on nodes knowingly to be on the bottom tiers to force eviction
in a natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict'
option could be added to the parameters for both memory.reclaim and
per-node interfaces to force this action unconditionally.
[akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman]
[dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel]
Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld
Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As this will be called from non page allocator paths for proactive
reclaim, allow users to pass the sc and nr of pages, and adjust the return
value as well. No change in semantics.
Link: https://lkml.kernel.org/r/20250623185851.830632-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|