summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-03-05MAINTAINERS: .mailmap: update Sumit Garg's email addressSumit Garg
Update Sumit Garg's email address to @kernel.org. Link: https://lkml.kernel.org/r/20250227113228.1809449-1-sumit.garg@linaro.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jens Wiklander <jens.wiklander@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] ↵Gabriel Krisman Bertazi
for empty zone" Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") removes the protection of lower zones from allocations targeting memory-less high zones. This had an unintended impact on the pattern of reclaims because it makes the high-zone-targeted allocation more likely to succeed in lower zones, which adds pressure to said zones. I.e, the following corresponding checks in zone_watermark_ok/zone_watermark_fast are less likely to trigger: if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) return false; As a result, we are observing an increase in reclaim and kswapd scans, due to the increased pressure. This was initially observed as increased latency in filesystem operations when benchmarking with fio on a machine with some memory-less zones, but it has since been associated with increased contention in locks related to memory reclaim. By reverting this patch, the original performance was recovered on that machine. The original commit was introduced as a clarification of the /proc/zoneinfo output, so it doesn't seem there are usecases depending on it, making the revert a simple solution. For reference, I collected vmstat with and without this patch on a freshly booted system running intensive randread io from an nvme for 5 minutes. I got: rpm-6.12.0-slfo.1.2 -> pgscan_kswapd 5629543865 Patched -> pgscan_kswapd 33580844 33M scans is similar to what we had in kernels predating this patch. These numbers is fairly representative of the workload on this machine, as measured in several runs. So we are talking about a 2-order of magnitude increase. Link: https://lkml.kernel.org/r/20250226032258.234099-1-krisman@suse.de Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix finish_fault() handling for large foliosBrian Geffon
When handling faults for anon shmem finish_fault() will attempt to install ptes for the entire folio. Unfortunately if it encounters a single non-pte_none entry in that range it will bail, even if the pte that triggered the fault is still pte_none. When this situation happens the fault will be retried endlessly never making forward progress. This patch fixes this behavior and if it detects that a pte in the range is not pte_none it will fall back to setting a single pte. [bgeffon@google.com: tweak whitespace] Link: https://lkml.kernel.org/r/20250227133236.1296853-1-bgeffon@google.com Link: https://lkml.kernel.org/r/20250226162341.915535-1-bgeffon@google.com Fixes: 43e027e41423 ("mm: memory: extend finish_fault() to support large folio") Signed-off-by: Brian Geffon <bgeffon@google.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: don't skip arch_sync_kernel_mappings() in error pathsRyan Roberts
Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still required to sync any pgtable updates that may have occurred prior to hitting the error condition. These are theoretical bugs discovered during code review. Link: https://lkml.kernel.org/r/20250226121610.2401743-1-ryan.roberts@arm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christop Hellwig <hch@infradead.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: shmem: remove unnecessary warning in shmem_writepage()Ricardo Cañuelo Navarro
Although the scenario where shmem_writepage() is called with info->flags & VM_LOCKED is unlikely to happen, it's still possible, as evidenced by syzbot [1]. However, the warning in this case isn't necessary because the situation is already handled correctly [2]. [2] https://lore.kernel.org/lkml/8afe1f7f-31a2-4fc0-1fbd-f9ba8a116fe3@google.com/ Link: https://lkml.kernel.org/r/20250226-20250221-warning-in-shmem_writepage-v1-1-5ad19420e17e@igalia.com Fixes: 9a976f0c847b ("shmem: skip page split if we're not reclaiming") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Closes: https://lore.kernel.org/lkml/ZZ9PShXjKJkVelNm@xpf.sh.intel.com/ [1] Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Florent Revest <revest@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Florent Revest <revest@chromium.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05userfaultfd: fix PTE unmapping stack-allocated PTE copiesSuren Baghdasaryan
Current implementation of move_pages_pte() copies source and destination PTEs in order to detect concurrent changes to PTEs involved in the move. However these copies are also used to unmap the PTEs, which will fail if CONFIG_HIGHPTE is enabled because the copies are allocated on the stack. Fix this by using the actual PTEs which were kmap()ed. Link: https://lkml.kernel.org/r/20250226185510.2732648-3-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05userfaultfd: do not block on locking a large folio with raised refcountSuren Baghdasaryan
Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised folio refcount. split_folio() expects the reference count to be exactly mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN otherwise. If multiple processes are trying to move the same large folio, they raise the refcount (all tasks succeed in that) then one of them succeeds in locking the folio, while others will block in folio_lock() while keeping the refcount raised. The winner of this race will proceed with calling split_folio() and will fail returning EAGAIN to the caller and unlocking the folio. The next competing process will get the folio locked and will go through the same flow. In the meantime the original winner will be retried and will block in folio_lock(), getting into the queue of waiting processes only to repeat the same path. All this results in a livelock. An easy fix would be to avoid waiting for the folio lock while holding folio refcount, similar to madvise_free_huge_pmd() where folio lock is acquired before raising the folio refcount. Since we lock and take a refcount of the folio while holding the PTE lock, changing the order of these operations should not break anything. Modify move_pages_pte() to try locking the folio first and if that fails and the folio is large then return EAGAIN without touching the folio refcount. If the folio is single-page then split_folio() is not called, so we don't have this issue. Lokesh has a reproducer [1] and I verified that this change fixes the issue. [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock [akpm@linux-foundation.org: reflow comment to 80 cols, s/end/end up/] Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pagesSun YangKai
This is currently the only atomic_long_t variable initialized by ATOMIC_INIT macro found in the kernel by using `grep -r atomic_long_t | grep ATOMIC_INIT` This was introduced in 6e1fa555ec77, in which we modified the type of zswap_stored_pages to atomic_long_t, but didn't change the initialization. Link: https://lkml.kernel.org/r/20250226153253.19179-1-sunk67188@gmail.com Fixes: 6e1fa555ec77 ("mm: zswap: modify zswap_stored_pages to be atomic_long_t") Signed-off-by: Sun YangKai <sunk67188@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: shmem: fix potential data corruption during shmem swapinBaolin Wang
Alex and Kairui reported some issues (system hang or data corruption) when swapping out or swapping in large shmem folios. This is especially easy to reproduce when the tmpfs is mount with the 'huge=within_size' parameter. Thanks to Kairui's reproducer, the issue can be easily replicated. The root cause of the problem is that swap readahead may asynchronously swap in order 0 folios into the swap cache, while the shmem mapping can still store large swap entries. Then an order 0 folio is inserted into the shmem mapping without splitting the large swap entry, which overwrites the original large swap entry, leading to data corruption. When getting a folio from the swap cache, we should split the large swap entry stored in the shmem mapping if the orders do not match, to fix this issue. Link: https://lkml.kernel.org/r/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Reported-by: Kairui Song <ryncsn@gmail.com> Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/ Tested-by: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcow <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix kernel BUG when userfaultfd_move encounters swapcacheBarry Song
userfaultfd_move() checks whether the PTE entry is present or a swap entry. - If the PTE entry is present, move_present_pte() handles folio migration by setting: src_folio->index = linear_page_index(dst_vma, dst_addr); - If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr. This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache. This creates a race window between steps 2 and 4. 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() occurs in the window between steps 2 and 4, after the swap PTE has been moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swapcache. However, since the folio's index has not been updated to match the destination VMA, do_swap_page() will detect a mismatch. This can result in two critical issues depending on the system configuration. If KSM is disabled, both small and large folios can trigger a BUG during the add_rmap operation due to: page_pgoff(folio, page) != linear_page_index(vma, address) [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2 If KSM is enabled, Peter Xu also discovered that do_swap_page() may trigger an unexpected CoW operation for small folios because ksm_might_need_to_copy() allocates a new folio when the folio index does not match linear_page_index(vma, addr). This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock. Note that for large folios, in the swapcache handling path, we directly return -EBUSY since split_folio() will return -EBUSY regardless if the folio is under writeback or unmapped. This is not an urgent issue, so a follow-up patch may address it separately. [v-songbaohua@oppo.com: minor cleanup according to Peter Xu] Link: https://lkml.kernel.org/r/20250226024411.47092-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250226001400.9129-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05selftests/damon/damon_nr_regions: sort collected regiosn before checking ↵SeongJae Park
with min/max boundaries damon_nr_regions.py starts DAMON, periodically collect number of regions in snapshots, and see if it is in the requested range. The check code assumes the numbers are sorted on the collection list, but there is no such guarantee. Hence this can result in false positive test success. Sort the list before doing the check. Link: https://lkml.kernel.org/r/20250225222333.505646-4-sj@kernel.org Fixes: 781497347d1b ("selftests/damon: implement test for min/max_nr_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05selftests/damon/damon_nr_regions: set ops update for merge results check to ↵SeongJae Park
100ms damon_nr_regions.py updates max_nr_regions to a number smaller than expected number of real regions and confirms DAMON respect the harsh limit. To give time for DAMON to make changes for the regions, 3 aggregation intervals (300 milliseconds) are given. The internal mechanism works with not only the max_nr_regions, but also sz_limit, though. It avoids merging region if that casn make region of size larger than sz_limit. In the test, sz_limit is set too small to achive the new max_nr_regions, unless it is updated for the new min_nr_regions. But the update is done only once per operations set update interval, which is one second by default. Hence, the test randomly incurs false positive failures. Fix it by setting the ops interval same to aggregation interval, to make sure sz_limit is updated by the time of the check. Link: https://lkml.kernel.org/r/20250225222333.505646-3-sj@kernel.org Fixes: 8bf890c81612 ("selftests/damon/damon_nr_regions: test online-tuned max_nr_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05selftests/damon/damos_quota: make real expectation of quota exceedsSeongJae Park
Patch series "selftests/damon: three fixes for false results". Fix three DAMON selftest bugs that cause two and one false positive failures and successes. This patch (of 3): damos_quota.py assumes the quota will always exceeded. But whether quota will be exceeded or not depend on the monitoring results. Actually the monitored workload has chaning access pattern and hence sometimes the quota may not really be exceeded. As a result, false positive test failures happen. Expect how much time the quota will be exceeded by checking the monitoring results, and use it instead of the naive assumption. Link: https://lkml.kernel.org/r/20250225222333.505646-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250225222333.505646-2-sj@kernel.org Fixes: 51f58c9da14b ("selftests/damon: add a test for DAMOS quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05include/linux/log2.h: mark is_power_of_2() with __always_inlineSu Hui
When building kernel with randconfig, there is an error: In function `kvm_is_cr4_bit_set',inlined from `kvm_update_cpuid_runtime' at arch/x86/kvm/cpuid.c:310:9: include/linux/compiler_types.h:542:38: error: call to `__compiletime_assert_380' declared with attribute error: BUILD_BUG_ON failed: !is_power_of_2(cr4_bit). '!is_power_of_2(X86_CR4_OSXSAVE)' is False, but gcc treats is_power_of_2() as non-inline function and a compilation error happens. Fix this by marking is_power_of_2() with __always_inline. Link: https://lkml.kernel.org/r/20250221071624.1356899-1-suhui@nfschina.com Signed-off-by: Su Hui <suhui@nfschina.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05NFS: fix nfs_release_folio() to not deadlock via kcompactd writebackMike Snitzer
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm, swap: avoid BUG_ON in relocate_cluster()Kemeng Shi
If allocation is racy with swapoff, we may call free_cluster for cluster already in free list and trigger BUG_ON() as following: Allocation Swapoff cluster_alloc_swap_entry ... /* may get a free cluster with offset */ offset = xxx; if (offset) ci = lock_cluster(si, offset); ... del_from_avail_list(p, true); si->flags &= ~SWP_WRITEOK; alloc_swap_scan_cluster(si, ci, ...) ... /* failed to alloc entry from free entry */ if (!cluster_alloc_range(...)) break; ... /* add back a free cluster */ relocate_cluster(si, ci); if (!ci->count) free_cluster(si, ci); VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE); To prevent the BUG_ON(), call free_cluster() for free cluster to move the cluster to tail of list. Check cluster is not free before calling free_cluster() in relocate_cluster() to avoid BUG_ON(). Link: https://lkml.kernel.org/r/20250222160850.505274-4-shikemeng@huaweicloud.com Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: swap: use correct step in loop to wait all clusters in wait_for_allocation()Kemeng Shi
Use correct step in loop to wait all clusters in wait_for_allocation(). If we miss some cluster in wait_for_allocation(), use after free may occur as follows: shmem_writepage swapoff folio_alloc_swap get_swap_pages scan_swap_map_slots cluster_alloc_swap_entry alloc_swap_scan_cluster cluster_alloc_range /* SWP_WRITEOK is valid */ if (!(si->flags & SWP_WRITEOK)) ... del_from_avail_list(p, true); ... /* miss the cluster in shmem_writepage */ wait_for_allocation() ... try_to_unuse() memset(si->swap_map + start, usage, nr_pages); swap_range_alloc(si, nr_pages); ci->count += nr_pages; /* return a valid entry */ ... exit_swap_address_space(p->type); ... ... add_to_swap_cache /* dereference swap_address_space(entry) which is NULL */ xas_lock_irq(&xas); Link: https://lkml.kernel.org/r/20250222160850.505274-3-shikemeng@huaweicloud.com Fixes: 9a0ddeb79880 ("mm, swap: hold a reference during scan and cleanup flag usage") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: swap: add back full cluster when no entry is reclaimedKemeng Shi
If no swap cache is reclaimed, cluster taken off from full_clusters list will not be put in any list and we can't reclaime HAS_CACHE slots efficiently. Do relocate_cluster for such cluster to avoid inefficiency. Link: https://lkml.kernel.org/r/20250224113910.522439-1-shikemeng@huaweicloud.com Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: abort vma_modify() on merge out of memory failureLorenzo Stoakes
The remainder of vma_modify() relies upon the vmg state remaining pristine after a merge attempt. Usually this is the case, however in the one edge case scenario of a merge attempt failing not due to the specified range being unmergeable, but rather due to an out of memory error arising when attempting to commit the merge, this assumption becomes untrue. This results in vmg->start, end being modified, and thus the proceeding attempts to split the VMA will be done with invalid start/end values. Thankfully, it is likely practically impossible for us to hit this in reality, as it would require a maple tree node pre-allocation failure that would likely never happen due to it being 'too small to fail', i.e. the kernel would simply keep retrying reclaim until it succeeded. However, this scenario remains theoretically possible, and what we are doing here is wrong so we must correct it. The safest option is, when this scenario occurs, to simply give up the operation. If we cannot allocate memory to merge, then we cannot allocate memory to split either (perhaps moreso!). Any scenario where this would be happening would be under very extreme (likely fatal) memory pressure, so it's best we give up early. So there is no doubt it is appropriate to simply bail out in this scenario. However, in general we must if at all possible never assume VMG state is stable after a merge attempt, since merge operations update VMG fields. As a result, additionally also make this clear by storing start, end in local variables. The issue was reported originally by syzkaller, and by Brad Spengler (via an off-list discussion), and in both instances it manifested as a triggering of the assert: VM_WARN_ON_VMG(start >= end, vmg); In vma_merge_existing_range(). It seems at least one scenario in which this is occurring is one in which the merge being attempted is due to an madvise() across multiple VMAs which looks like this: start end |<------>| |----------|------| | vma | next | |----------|------| When madvise_walk_vmas() is invoked, we first find vma in the above (determining prev to be equal to vma as we are offset into vma), and then enter the loop. We determine the end of vma that forms part of the range we are madvise()'ing by setting 'tmp' to this value: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; We then invoke the madvise() operation via visit(), letting prev get updated to point to vma as part of the operation: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where the visit() function pointer in this instance is madvise_vma_behavior(). As observed in syzkaller reports, it is ultimately madvise_update_vma() that is invoked, calling vma_modify_flags_name() and vma_modify() in turn. Then, in vma_modify(), we attempt the merge: merged = vma_merge_existing_range(vmg); if (merged) return merged; We invoke this with vmg->start, end set to start, tmp as such: start tmp |<--->| |----------|------| | vma | next | |----------|------| We find ourselves in the merge right scenario, but the one in which we cannot remove the middle (we are offset into vma). Here we have a special case where vmg->start, end get set to perhaps unintuitive values - we intended to shrink the middle VMA and expand the next. This means vmg->start, end are set to... vma->vm_start, start. Now the commit_merge() fails, and vmg->start, end are left like this. This means we return to the rest of vma_modify() with vmg->start, end (here denoted as start', end') set as: start' end' |<-->| |----------|------| | vma | next | |----------|------| So we now erroneously try to split accordingly. This is where the unfortunate stuff begins. We start with: /* Split any preceding portion of the VMA. */ if (vma->vm_start < vmg->start) { ... } This doesn't trigger as we are no longer offset into vma at the start. But then we invoke: /* Split any trailing portion of the VMA. */ if (vma->vm_end > vmg->end) { ... } Which does get invoked. This leaves us with: start' end' |<-->| |----|-----|------| | vma| new | next | |----|-----|------| We then return ultimately to madvise_walk_vmas(). Here 'new' is unknown, and putting back the values known in this function we are faced with: start tmp end | | | |----|-----|------| | vma| new | next | |----|-----|------| prev Then: start = tmp; So: start end | | |----|-----|------| | vma| new | next | |----|-----|------| prev The following code does not cause anything to happen: if (prev && start < prev->vm_end) start = prev->vm_end; if (start >= end) break; And then we invoke: if (prev) vma = find_vma(mm, prev->vm_end); Which is where a problem occurs - we don't know about 'new' so we essentially look for the vma after prev, which is new, whereas we actually intended to discover next! So we end up with: start end | | |----|-----|------| |prev| vma | next | |----|-----|------| And we have successfully bypassed all of the checks madvise_walk_vmas() has to ensure early exit should we end up moving out of range. We loop around, and hit: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; Oh dear. Now we have: tmp start end | | |----|-----|------| |prev| vma | next | |----|-----|------| We then invoke: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where start == tmp. That is, a zero range. This is not good. We invoke visit() which is madvise_vma_behavior() which does not check the range (for good reason, it assumes all checks have been done before it was called), which in turn finally calls madvise_update_vma(). The madvise_update_vma() function calls vma_modify_flags_name() in turn, which ultimately invokes vma_modify() with... start == end. vma_modify() calls vma_merge_existing_range() and finally we hit: VM_WARN_ON_VMG(start >= end, vmg); Which triggers, as start == end. While it might be useful to add some CONFIG_DEBUG_VM asserts in these instances to catch this kind of error, since we have just eliminated any possibility of that happening, we will add such asserts separately as to reduce churn and aid backporting. Link: https://lkml.kernel.org/r/20250222161952.41957-1-lorenzo.stoakes@oracle.com Fixes: 2f1c6611b0a8 ("mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Brad Spengler <brad.spengler@opensrcsec.com> Reported-by: Brad Spengler <brad.spengler@opensrcsec.com> Reported-by: syzbot+46423ed8fa1f1148c6e4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/6774c98f.050a0220.25abdd.0991.GAE@google.com/ Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm/hugetlb: wait for hugetlb folios to be freedGe Yang
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix possible NULL pointer dereference in __swap_duplicategao xu
Add a NULL check on the return value of swp_swap_info in __swap_duplicate to prevent crashes caused by NULL pointer dereference. The reason why swp_swap_info() returns NULL is unclear; it may be due to CPU cache issues or DDR bit flips. The probability of this issue is very small - it has been observed to occur approximately 1 in 500,000 times per week. The stack info we encountered is as follows: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [RB/E]rb_sreason_str_set: sreason_str set null_pointer Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 39-bit VAs, pgdp=00000008a80e5000 [0000000000000058] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Skip md ftrace buffer dump for: 0x1609e0 ... pc : swap_duplicate+0x44/0x164 lr : copy_page_range+0x508/0x1e78 sp : ffffffc0f2a699e0 x29: ffffffc0f2a699e0 x28: ffffff8a5b28d388 x27: ffffff8b06603388 x26: ffffffdf7291fe70 x25: 0000000000000006 x24: 0000000000100073 x23: 00000000002d2d2f x22: 0000000000000008 x21: 0000000000000000 x20: 00000000002d2d2f x19: 18000000002d2d2f x18: ffffffdf726faec0 x17: 0000000000000000 x16: 0010000000000001 x15: 0040000000000001 x14: 0400000000000001 x13: ff7ffffffffffb7f x12: ffeffffffffffbff x11: ffffff8a5c7e1898 x10: 0000000000000018 x9 : 0000000000000006 x8 : 1800000000000000 x7 : 0000000000000000 x6 : ffffff8057c01f10 x5 : 000000000000a318 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000006daf200000 x1 : 0000000000000001 x0 : 18000000002d2d2f Call trace: swap_duplicate+0x44/0x164 copy_page_range+0x508/0x1e78 copy_process+0x1278/0x21cc kernel_clone+0x90/0x438 __arm64_sys_clone+0x5c/0x8c invoke_syscall+0x58/0x110 do_el0_svc+0x8c/0xe0 el0_svc+0x38/0x9c el0t_64_sync_handler+0x44/0xec el0t_64_sync+0x1a8/0x1ac Code: 9139c35a 71006f3f 54000568 f8797b55 (f9402ea8) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs The patch seems to only provide a workaround, but there are no more effective software solutions to handle the bit flips problem. This path will change the issue from a system crash to a process exception, thereby reducing the impact on the entire machine. akpm: this is probably a kernel bug, but this patch keeps the system running and doesn't reduce that bug's debuggability. Link: https://lkml.kernel.org/r/e223b0e6ba2f4924984b1917cc717bd5@honor.com Signed-off-by: gao xu <gaoxu2@honor.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05dma: kmsan: export kmsan_handle_dma() for modulesSebastian Andrzej Siewior
kmsan_handle_dma() is used by virtio_ring() which can be built as a module. kmsan_handle_dma() needs to be exported otherwise building the virtio_ring fails. Export kmsan_handle_dma for modules. Link: https://lkml.kernel.org/r/20250218091411.MMS3wBN9@linutronix.de Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502150634.qjxwSeJR-lkp@intel.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05Documentation: fix doc link to fault-injection.rstUjwal Kundur
Fix incorrect reference to fault-injection docs Link: https://lkml.kernel.org/r/20250215105106.734-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folioMa Wupeng
Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined) add page poison checks in do_migrate_range in order to make offline hwpoisoned page possible by introducing isolate_lru_page and try_to_unmap for hwpoisoned page. However folio lock must be held before calling try_to_unmap. Add it to fix this problem. Warning will be produced if folio is not locked during unmap: ------------[ cut here ]------------ kernel BUG at ./include/linux/swapops.h:400! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 4 UID: 0 PID: 411 Comm: bash Tainted: G W 6.13.0-rc1-00016-g3c434c7ee82a-dirty #41 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0xb08/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0xb08/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c Code: f9407be0 b5fff320 d4210000 17ffff97 (d4210000) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250217014329.3610326-4-mawupeng1@huawei.com Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: memory-hotplug: check folio ref count first in do_migrate_rangeMa Wupeng
If a folio has an increased reference count, folio_try_get() will acquire it, perform necessary operations, and then release it. In the case of a poisoned folio without an elevated reference count (which is unlikely for memory-failure), folio_try_get() will simply bypass it. Therefore, relocate the folio_try_get() function, responsible for checking and acquiring this reference count at first. Link: https://lkml.kernel.org/r/20250217014329.3610326-3-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: memory-failure: update ttu flag inside unmap_poisoned_folioMa Wupeng
Patch series "mm: memory_failure: unmap poisoned folio during migrate properly", v3. Fix two bugs during folio migration if the folio is poisoned. This patch (of 3): Commit 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") introduce TTU_HWPOISON to replace TTU_IGNORE_HWPOISON in order to stop send SIGBUS signal when accessing an error page after a memory error on a clean folio. However during page migration, anon folio must be set with TTU_HWPOISON during unmap_*(). For pagecache we need some policy just like the one in hwpoison_user_mappings to set this flag. So move this policy from hwpoison_user_mappings to unmap_poisoned_folio to handle this warning properly. Warning will be produced during unamp poison folio with the following log: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 365 at mm/rmap.c:1847 try_to_unmap_one+0x8fc/0xd3c Modules linked in: CPU: 1 UID: 0 PID: 365 Comm: bash Tainted: G W 6.13.0-rc1-00018-gacdb4bbda7ab #42 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0x8fc/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0x8fc/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c ---[ end trace 0000000000000000 ]--- [mawupeng1@huawei.com: unmap_poisoned_folio(): remove shadowed local `mapping', per Miaohe] Link: https://lkml.kernel.org/r/20250219060653.3849083-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-2-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05arm: pgtable: fix NULL pointer dereference issueQi Zheng
When update_mmu_cache_range() is called by update_mmu_cache(), the vmf parameter is NULL, which will cause a NULL pointer dereference issue in adjust_pte(): Unable to handle kernel NULL pointer dereference at virtual address 00000030 when read Hardware name: Atmel AT91SAM9 PC is at update_mmu_cache_range+0x1e0/0x278 LR is at pte_offset_map_rw_nolock+0x18/0x2c Call trace: update_mmu_cache_range from remove_migration_pte+0x29c/0x2ec remove_migration_pte from rmap_walk_file+0xcc/0x130 rmap_walk_file from remove_migration_ptes+0x90/0xa4 remove_migration_ptes from migrate_pages_batch+0x6d4/0x858 migrate_pages_batch from migrate_pages+0x188/0x488 migrate_pages from compact_zone+0x56c/0x954 compact_zone from compact_node+0x90/0xf0 compact_node from kcompactd+0x1d4/0x204 kcompactd from kthread+0x120/0x12c kthread from ret_from_fork+0x14/0x38 Exception stack(0xc0d8bfb0 to 0xc0d8bff8) To fix it, do not rely on whether 'ptl' is equal to decide whether to hold the pte lock, but decide it by whether CONFIG_SPLIT_PTE_PTLOCKS is enabled. In addition, if two vmas map to the same PTE page, there is no need to hold the pte lock again, otherwise a deadlock will occur. Just add the need_lock parameter to let adjust_pte() know this information. Link: https://lkml.kernel.org/r/20250217024924.57996-1-zhengqi.arch@bytedance.com Fixes: fc9c45b71f43 ("arm: adjust_pte() use pte_offset_map_rw_nolock()") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Closes: https://lore.kernel.org/lkml/CAM1KZSmZ2T_riHvay+7cKEFxoPgeVpHkVFTzVVEQ1BO0cLkHEQ@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russel King <linux@armlinux.org.uk> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05m68k: sun3: add check for __pgd_alloc()Haoxiang Li
Add check for the return value of __pgd_alloc() in pgd_alloc() to prevent null pointer dereference. Link: https://lkml.kernel.org/r/20250217160017.2375536-1-haoxiang_li2024@163.com Fixes: a9b3c355c2e6 ("asm-generic: pgalloc: provide generic __pgd_{alloc,free}") Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Sam Creasey <sammy@sammy.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05selftests/damon/damos_quota_goal: handle minimum quota that cannot be ↵SeongJae Park
further reduced damos_quota_goal.py selftest see if DAMOS quota goals tuning feature increases or reduces the effective size quota for given score as expected. The tuning feature sets the minimum quota size as one byte, so if the effective size quota is already one, we cannot expect it further be reduced. However the test is not aware of the edge case, and fails since it shown no expected change of the effective quota. Handle the case by updating the failure logic for no change to see if it was the case, and simply skips to next test input. Link: https://lkml.kernel.org/r/20250217182304.45215-1-sj@kernel.org Fixes: f1c07c0a1662 ("selftests/damon: add a test for DAMOS quota goal") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202502171423.b28a918d-lkp@intel.com Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.10.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05Revert "selftests/mm: remove local __NR_* definitions"John Hubbard
This reverts commit a5c6bc590094a1a73cf6fa3f505e1945d2bf2461. The general approach described in commit e076eaca5906 ("selftests: break the dependency upon local header files") was taken one step too far here: it should not have been extended to include the syscall numbers. This is because doing so would require per-arch support in tools/include/uapi, and no such support exists. This revert fixes two separate reports of test failures, from Dave Hansen[1], and Li Wang[2]. An excerpt of Dave's report: Before this commit (a5c6bc590094a1a73cf6fa3f505e1945d2bf2461) things are fine. But after, I get: running PKEY tests for unsupported CPU/OS An excerpt of Li's report: I just found that mlock2_() return a wrong value in mlock2-test [1] https://lore.kernel.org/dc585017-6740-4cab-a536-b12b37a7582d@intel.com [2] https://lore.kernel.org/CAEemH2eW=UMu9+turT2jRie7+6ewUazXmA6kL+VBo3cGDGU6RA@mail.gmail.com Link: https://lkml.kernel.org/r/20250214033850.235171-1-jhubbard@nvidia.com Fixes: a5c6bc590094 ("selftests/mm: remove local __NR_* definitions") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Li Wang <liwang@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rich Felker <dalias@libc.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not definedKemeng Shi
In case CONFIG_XARRAY_MULTI is not defined, xa_store_order can store a multi-index entry but xas_for_each can't tell sbiling entry from valid entry. So the check_pause failed when we store a multi-index entry and wish xas_for_each can handle it normally. Avoid to store multi-index entry when CONFIG_XARRAY_MULTI is disabled to fix the failure. Link: https://lkml.kernel.org/r/20250213163659.414309-1-shikemeng@huaweicloud.com Fixes: c9ba5249ef8b ("Xarray: move forward index correctly in xas_pause()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/r/CAMuHMdU_bfadUO=0OZ=AoQ9EAmQPA4wsLCBqohXR+QCeCKRn4A@mail.gmail.com Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17kasan: don't call find_vm_area() in a PREEMPT_RT kernelWaiman Long
The following bug report was found when running a PREEMPT_RT debug kernel. BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 140605, name: kunit_try_catch preempt_count: 1, expected: 0 Call trace: rt_spin_lock+0x70/0x140 find_vmap_area+0x84/0x168 find_vm_area+0x1c/0x50 print_address_description.constprop.0+0x2a0/0x320 print_report+0x108/0x1f8 kasan_report+0x90/0xc8 Since commit e30a0361b851 ("kasan: make report_lock a raw spinlock"), report_lock was changed to raw_spinlock_t to fix another similar PREEMPT_RT problem. That alone isn't enough to cover other corner cases. print_address_description() is always invoked under the report_lock. The context under this lock is always atomic even on PREEMPT_RT. find_vm_area() acquires vmap_node::busy.lock which is a spinlock_t, becoming a sleeping lock on PREEMPT_RT and must not be acquired in atomic context. Don't invoke find_vm_area() on PREEMPT_RT and just print the address. Non-PREEMPT_RT builds remain unchanged. Add a DEFINE_WAIT_OVERRIDE_MAP() macro to tell lockdep that this lock nesting is allowed because the PREEMPT_RT part (which is invalid) has been taken care of. This macro was first introduced in commit 0cce06ba859a ("debugobjects,locking: Annotate debug_object_fill_pool() wait type violation"). Link: https://lkml.kernel.org/r/20250217204402.60533-1-longman@redhat.com Fixes: e30a0361b851 ("kasan: make report_lock a raw spinlock") Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mariano Pache <npache@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17MAINTAINERS: update Nick's contact infoNick Desaulniers
Updated .mailmap, but forgot these other places. Link: https://lkml.kernel.org/r/20250212173523.3979840-1-ndesaulniers@google.com Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17selftests/mm: fix check for running THP testsMark Brown
When testing if we should try to compact memory or drop caches before we run the THP or HugeTLB tests we use | as an or operator. This doesn't work since run_vmtests.sh is written in shell where this is used to pipe the output of the first argument into the second. Instead use the shell's -o operator. Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-1-44702f538522@kernel.org Fixes: b433ffa8dbac ("selftests: mm: perform some system cleanup before using hugepages") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mm: hugetlb: avoid fallback for specific node allocation of 1G pagesLuiz Capitulino
When using the HugeTLB kernel command-line to allocate 1G pages from a specific node, such as: default_hugepagesz=1G hugepages=1:1 If node 1 happens to not have enough memory for the requested number of 1G pages, the allocation falls back to other nodes. A quick way to reproduce this is by creating a KVM guest with a memory-less node and trying to allocate 1 1G page from it. Instead of failing, the allocation will fallback to other nodes. This defeats the purpose of node specific allocation. Also, specific node allocation for 2M pages don't have this behavior: the allocation will just fail for the pages it can't satisfy. This issue happens because HugeTLB calls memblock_alloc_try_nid_raw() for 1G boot-time allocation as this function falls back to other nodes if the allocation can't be satisfied. Use memblock_alloc_exact_nid_raw() instead, which ensures that the allocation will only be satisfied from the specified node. Link: https://lkml.kernel.org/r/20250211034856.629371-1-luizcap@redhat.com Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17memcg: avoid dead loop when setting memory.maxChen Ridong
A softlockup issue was found with stress test: watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181] CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1 Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu RIP: 0010:stop_machine_yield+0x2/0x10 RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246 RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000 RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40 RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40 R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20 FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000) CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: multi_cpu_stop+0x8f/0x100 cpu_stopper_thread+0x90/0x140 smpboot_thread_fn+0xad/0x150 kthread+0xc2/0x100 ret_from_fork+0x2d/0x50 The stress test involves CPU hotplug operations and memory control group (memcg) operations. The scenario can be described as follows: echo xx > memory.max cache_ap_online oom_reaper (CPU23) (CPU50) xx < usage stop_machine_from_inactive_cpu for(;;) // all active cpus trigger OOM queue_stop_cpus_work // waiting oom_reaper multi_cpu_stop(migration/xx) // sync all active cpus ack // waiting cpu23 ack // CPU50 loops in multi_cpu_stop waiting cpu50 Detailed explanation: 1. When the usage is larger than xx, an OOM may be triggered. If the process does not handle with ths kill signal immediately, it will loop in the memory_max_write. 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the active cpus. Within the multi_cpu_stop function, it attempts to synchronize the CPU states. However, the CPU23 didn't acknowledge because it is stuck in a loop within the for(;;). 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting for CPU23 to acknowledge the synchronization request. 4. Finally, it formed cyclic dependency and lead to softlockup and dead loop. To fix this issue, add cond_resched() in the memory_max_write, so that it will not block migration task. Link: https://lkml.kernel.org/r/20250211081819.33307-1-chenridong@huaweicloud.com Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Wang Weiyang <wangweiyang2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mailmap: update Nick's entryNick Desaulniers
Link: https://lkml.kernel.org/r/20250211212117.3195265-1-ndesaulniers@google.com Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mm: pgtable: fix incorrect reclaim of non-empty PTE pagesQi Zheng
In zap_pte_range(), if the pte lock was released midway, the pte entries may be refilled with physical pages by another thread, which may cause a non-empty PTE page to be reclaimed and eventually cause the system to crash. To fix it, fall back to the slow path in this case to recheck if all pte entries are still none. Link: https://lkml.kernel.org/r/20250211072625.89188-1-zhengqi.arch@bytedance.com Fixes: 6375e95f381e ("mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Christian Brauner <brauner@kernel.org> Closes: https://lore.kernel.org/all/20250207-anbot-bankfilialen-acce9d79a2c7@brauner/ Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Closes: https://lore.kernel.org/all/152296f3-5c81-4a94-97f3-004108fba7be@gmx.com/ Tested-by: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17taskstats: modify taskstats versionWang Yaxin
After adding "delay max" and "delay min" to the taskstats structure, the taskstats version needs to be updated. Link: https://lkml.kernel.org/r/20250208144901218Q5ptVpqsQkb2MOEmW4Ujn@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17getdelays: fix error format charactersWang Yaxin
getdelays had a compilation issue because the format string was not updated when the "delay min" was added. For example, after adding the "delay min" in printf, there were 7 strings but only 6 "%s" format specifiers. Similarly, after adding the 't->cpu_delay_total', there were 7 variables but only 6 format characters specifiers, causing compilation issues as follows. This commit fixes these issues to ensure that getdelays compiles correctly. root@xx:~/linux-next/tools/accounting$ make getdelays.c:199:9: warning: format `%llu' expects argument of type `long long unsigned int', but argument 8 has type `char *' [-Wformat=] 199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ..... 216 | "delay total", "delay average", "delay max", "delay min", | ~~~~~~~~~~~ | | | char * getdelays.c:200:21: note: format string is defined here 200 | " %15llu%15llu%15llu%15llu%15.3fms%13.6fms\n" | ~~~~~^ | | | long long unsigned int | %15s getdelays.c:199:9: warning: format `%f' expects argument of type `double', but argument 12 has type `long long unsigned int' [-Wformat=] 199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ..... 220 | (unsigned long long)t->cpu_delay_total, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | | long long unsigned int ..... Link: https://lkml.kernel.org/r/20250208144400544RduNRhwIpT3m2JyRBqskZ@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mm/migrate_device: don't add folio to be freed to LRU in ↵David Hildenbrand
migrate_device_finalize() If migration succeeded, we called folio_migrate_flags()->mem_cgroup_migrate() to migrate the memcg from the old to the new folio. This will set memcg_data of the old folio to 0. Similarly, if migration failed, memcg_data of the dst folio is left unset. If we call folio_putback_lru() on such folios (memcg_data == 0), we will add the folio to be freed to the LRU, making memcg code unhappy. Running the hmm selftests: # ./hmm-tests ... # RUN hmm.hmm_device_private.migrate ... [ 102.078007][T14893] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff27d200 pfn:0x13cc00 [ 102.079974][T14893] anon flags: 0x17ff00000020018(uptodate|dirty|swapbacked|node=0|zone=2|lastcpupid=0x7ff) [ 102.082037][T14893] raw: 017ff00000020018 dead000000000100 dead000000000122 ffff8881353896c9 [ 102.083687][T14893] raw: 00000007ff27d200 0000000000000000 00000001ffffffff 0000000000000000 [ 102.085331][T14893] page dumped because: VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled()) [ 102.087230][T14893] ------------[ cut here ]------------ [ 102.088279][T14893] WARNING: CPU: 0 PID: 14893 at ./include/linux/memcontrol.h:726 folio_lruvec_lock_irqsave+0x10e/0x170 [ 102.090478][T14893] Modules linked in: [ 102.091244][T14893] CPU: 0 UID: 0 PID: 14893 Comm: hmm-tests Not tainted 6.13.0-09623-g6c216bc522fd #151 [ 102.093089][T14893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 102.094848][T14893] RIP: 0010:folio_lruvec_lock_irqsave+0x10e/0x170 [ 102.096104][T14893] Code: ... [ 102.099908][T14893] RSP: 0018:ffffc900236c37b0 EFLAGS: 00010293 [ 102.101152][T14893] RAX: 0000000000000000 RBX: ffffea0004f30000 RCX: ffffffff8183f426 [ 102.102684][T14893] RDX: ffff8881063cb880 RSI: ffffffff81b8117f RDI: ffff8881063cb880 [ 102.104227][T14893] RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000 [ 102.105757][T14893] R10: 0000000000000001 R11: 0000000000000002 R12: ffffc900236c37d8 [ 102.107296][T14893] R13: ffff888277a2bcb0 R14: 000000000000001f R15: 0000000000000000 [ 102.108830][T14893] FS: 00007ff27dbdd740(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 [ 102.110643][T14893] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 102.111924][T14893] CR2: 00007ff27d400000 CR3: 000000010866e000 CR4: 0000000000750ef0 [ 102.113478][T14893] PKRU: 55555554 [ 102.114172][T14893] Call Trace: [ 102.114805][T14893] <TASK> [ 102.115397][T14893] ? folio_lruvec_lock_irqsave+0x10e/0x170 [ 102.116547][T14893] ? __warn.cold+0x110/0x210 [ 102.117461][T14893] ? folio_lruvec_lock_irqsave+0x10e/0x170 [ 102.118667][T14893] ? report_bug+0x1b9/0x320 [ 102.119571][T14893] ? handle_bug+0x54/0x90 [ 102.120494][T14893] ? exc_invalid_op+0x17/0x50 [ 102.121433][T14893] ? asm_exc_invalid_op+0x1a/0x20 [ 102.122435][T14893] ? __wake_up_klogd.part.0+0x76/0xd0 [ 102.123506][T14893] ? dump_page+0x4f/0x60 [ 102.124352][T14893] ? folio_lruvec_lock_irqsave+0x10e/0x170 [ 102.125500][T14893] folio_batch_move_lru+0xd4/0x200 [ 102.126577][T14893] ? __pfx_lru_add+0x10/0x10 [ 102.127505][T14893] __folio_batch_add_and_move+0x391/0x720 [ 102.128633][T14893] ? __pfx_lru_add+0x10/0x10 [ 102.129550][T14893] folio_putback_lru+0x16/0x80 [ 102.130564][T14893] migrate_device_finalize+0x9b/0x530 [ 102.131640][T14893] dmirror_migrate_to_device.constprop.0+0x7c5/0xad0 [ 102.133047][T14893] dmirror_fops_unlocked_ioctl+0x89b/0xc80 Likely, nothing else goes wrong: putting the last folio reference will remove the folio from the LRU again. So besides memcg complaining, adding the folio to be freed to the LRU is just an unnecessary step. The new flow resembles what we have in migrate_folio_move(): add the dst to the lru, remove migration ptes, unlock and unref dst. Link: https://lkml.kernel.org/r/20250210161317.717936-1-david@redhat.com Fixes: 8763cb45ab96 ("mm/migrate: new memory migration helper for use with device memory") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17tools/mm: fix build warnings with musl-libcFlorian Fainelli
musl-libc warns about the following: /home/florian/dev/buildroot/output/arm64/rpi4-b/host/aarch64-buildroot-linux-musl/sysroot/usr/include/sys/errno.h:1:2: attention: #warning redirecting incorrect #include <sys/errno.h> to <errno.h> [-Wcpp] 1 | #warning redirecting incorrect #include <sys/errno.h> to <errno.h> | ^~~~~~~ /home/florian/dev/buildroot/output/arm64/rpi4-b/host/aarch64-buildroot-linux-musl/sysroot/usr/include/sys/fcntl.h:1:2: attention: #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> [-Wcpp] 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> | ^~~~~~~ include errno.h and fcntl.h directly. Link: https://lkml.kernel.org/r/20250210200518.1137295-1-florian.fainelli@broadcom.com Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mailmap: add entry for Feng TangFeng Tang
Map my old business email to personal email. Link: https://lkml.kernel.org/r/20250205060457.53667-1-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17.mailmap: add entries for Jeff JohnsonJeff Johnson
Map past iterations of my e-mail addresses to the current one. Link: https://lkml.kernel.org/r/20250205-jjohnson-mailmap-v1-1-269cb7b1710d@oss.qualcomm.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mm,madvise,hugetlb: check for 0-length range after end address adjustmentRicardo Cañuelo Navarro
Add a sanity check to madvise_dontneed_free() to address a corner case in madvise where a race condition causes the current vma being processed to be backed by a different page size. During a madvise(MADV_DONTNEED) call on a memory region registered with a userfaultfd, there's a period of time where the process mm lock is temporarily released in order to send a UFFD_EVENT_REMOVE and let userspace handle the event. During this time, the vma covering the current address range may change due to an explicit mmap done concurrently by another thread. If, after that change, the memory region, which was originally backed by 4KB pages, is now backed by hugepages, the end address is rounded down to a hugepage boundary to avoid data loss (see "Fixes" below). This rounding may cause the end address to be truncated to the same address as the start. Make this corner case follow the same semantics as in other similar cases where the requested region has zero length (ie. return 0). This will make madvise_walk_vmas() continue to the next vma in the range (this time holding the process mm lock) which, due to the prev pointer becoming stale because of the vma change, will be the same hugepage-backed vma that was just checked before. The next time madvise_dontneed_free() runs for this vma, if the start address isn't aligned to a hugepage boundary, it'll return -EINVAL, which is also in line with the madvise api. From userspace perspective, madvise() will return EINVAL because the start address isn't aligned according to the new vma alignment requirements (hugepage), even though it was correctly page-aligned when the call was issued. Link: https://lkml.kernel.org/r/20250203075206.1452208-1-rcn@igalia.com Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Florent Revest <revest@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17mm/zswap: fix inconsistency when zswap_store_page() failsHyeonggon Yoo
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") skips charging any zswap entries when it failed to zswap the entire folio. However, when some base pages are zswapped but it failed to zswap the entire folio, the zswap operation is rolled back. When freeing zswap entries for those pages, zswap_entry_free() uncharges the zswap entries that were not previously charged, causing zswap charging to become inconsistent. This inconsistency triggers two warnings with following steps: # On a machine with 64GiB of RAM and 36GiB of zswap $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng $ sudo reboot The two warnings are: in mm/memcontrol.c:163, function obj_cgroup_release(): WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); in mm/page_counter.c:60, function page_counter_cancel(): if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", new, nr_pages)) zswap_stored_pages also becomes inconsistent in the same way. As suggested by Kanchana, increment zswap_stored_pages and charge zswap entries within zswap_store_page() when it succeeds. This way, zswap_entry_free() will decrement the counter and uncharge the entries when it failed to zswap the entire folio. While this could potentially be optimized by batching objcg charging and incrementing the counter, let's focus on fixing the bug this time and leave the optimization for later after some evaluation. After resolving the inconsistency, the warnings disappear. [42.hyeyoo@gmail.com: refactor zswap_store_page()] Link: https://lkml.kernel.org/r/20250131082037.2426-1-42.hyeyoo@gmail.com Link: https://lkml.kernel.org/r/20250129100844.2935-1-42.hyeyoo@gmail.com Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17lib/iov_iter: fix import_iovec_ubuf iovec managementPavel Begunkov
import_iovec() says that it should always be fine to kfree the iovec returned in @iovp regardless of the error code. __import_iovec_ubuf() never reallocates it and thus should clear the pointer even in cases when copy_iovec_*() fail. Link: https://lkml.kernel.org/r/378ae26923ffc20fd5e41b4360d673bf47b1775b.1738332461.git.asml.silence@gmail.com Fixes: 3b2deb0e46da ("iov_iter: import single vector iovecs as ITER_UBUF") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-17procfs: fix a locking bug in a vmcore_add_device_dump() error pathBart Van Assche
Unlock vmcore_mutex when returning -EBUSY. Link: https://lkml.kernel.org/r/20250129222003.1495713-1-bvanassche@acm.org Fixes: 0f3b1c40c652 ("fs/proc/vmcore: disallow vmcore modifications while the vmcore is open") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baoquan he <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-16Linux 6.14-rc3Linus Torvalds
2025-02-16Merge tag 'kbuild-fixes-v6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Fix annoying logs when building tools in parallel - Fix the Debian linux-headers package build again - Fix the target triple detection for userspace programs on Clang * tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: modpost: Fix a few typos in a comment kbuild: userprogs: fix bitsize and target detection on clang kbuild: fix linux-headers package build when $(CC) cannot link userspace tools: fix annoying "mkdir -p ..." logs when building tools in parallel