summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-07-09mm: change vm_get_page_prot() to accept vm_flags_t argumentLorenzo Stoakes
Patch series "use vm_flags_t consistently". The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this is exactly equivalent to unsigned long, but it should not be assumed to be. Much code that references vma->vm_flags already correctly uses vm_flags_t, but a fairly large chunk of code simply uses unsigned long and assumes that the two are equivalent. This series corrects that and has us use vm_flags_t consistently. This series is motivated by the desire to, in a future series, adjust vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or 64-bit in order to deal with the VMA flag exhaustion issue and avoid all the various problems that arise from it (being unable to use certain features in 32-bit, being unable to add new flags except for 64-bit, etc.) This is therefore a critical first step towards that goal. At any rate, using the correct type is of value regardless. We additionally take the opportunity to refer to VMA flags as vm_flags where possible to make clear what we're referring to. Overall, this series does not introduce any functional change. This patch (of 3): We abstract the type of the VMA flags to vm_flags_t, however in may places it is simply assumed this is unsigned long, which is simply incorrect. At the moment this is simply an incongruity, however in future we plan to change this type and therefore this change is a critical requirement for doing so. Overall, this patch does not introduce any functional change. [lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include] Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09highmem: remove a use of folio->pageMatthew Wilcox (Oracle)
Call folio_address() instead of page_address(). Link: https://lkml.kernel.org/r/20250613194825.3175276-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/huge_memory: don't mark refcounted folios special in vmf_insert_folio_pud()David Hildenbrand
Marking PUDs that map a "normal" refcounted folios as special is against our rules documented for vm_normal_page(). normal (refcounted) folios shall never have the page table mapping marked as special. Fortunately, there are not that many pud_special() check that can be mislead and are right now rather harmless: e.g., none so far bases decisions whether to grab a folio reference on that decision. Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big implications as it seems. Getting this right will get more important as we introduce folio_normal_page_pud() and start using it in more place where we currently special-case based on other VMA flags. Fix it just like we fixed vmf_insert_folio_pmd(). Add folio_mk_pud() to mimic what we do with folio_mk_pmd(). Link: https://lkml.kernel.org/r/20250613092702.1943533-4-david@redhat.com Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove zero_user()Matthew Wilcox (Oracle)
All users have now been converted to either memzero_page() or folio_zero_range(). Link: https://lkml.kernel.org/r/20250612143443.2848197-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: madvise: use per_vma lock for MADV_FREEBarry Song
MADV_FREE is another option, besides MADV_DONTNEED, for dynamic memory freeing in user-space native or Java heap memory management. For example, jemalloc can be configured to use MADV_FREE, and recent versions of the Android Java heap have also increasingly adopted MADV_FREE. Supporting per-VMA locking for MADV_FREE thus appears increasingly necessary. We have replaced walk_page_range() with walk_page_range_vma(). Along with the proposed madvise_lock_mode by Lorenzo, the necessary infrastructure is now in place to begin exploring per-VMA locking support for MADV_FREE and potentially other madvise using walk_page_range_vma(). This patch adds support for the PGWALK_VMA_RDLOCK walk_lock mode in walk_page_range_vma(), and leverages madvise_lock_mode from madv_behavior to select the appropriate walk_lock—either mmap_lock or per-VMA lock—based on the context. Because we now dynamically update the walk_ops->walk_lock field, we must ensure this is thread-safe. The madvise_free_walk_ops is now defined as a stack variable instead of a global constant. Link: https://lkml.kernel.org/r/20250611104745.57405-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memory-tier: fix abstract distance calculation overflowLi Zhijian
In mt_perf_to_adistance(), the calculation of abstract distance (adist) involves multiplying several int values including MEMTIER_ADISTANCE_DRAM. *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); Since these values can be large, the multiplication may exceed the maximum value of an int (INT_MAX) and overflow (Our platform did), leading to an incorrect adist. User-visible impact: The memory tiering subsystem will misinterpret slow memory (like CXL) as faster than DRAM, causing inappropriate demotion of pages from CXL (slow memory) to DRAM (fast memory). For example, we will see the following demotion chains from the dmesg, where Node0,1 are DRAM, and Node2,3 are CXL node: Demotion targets for Node 0: null Demotion targets for Node 1: null Demotion targets for Node 2: preferred: 0-1, fallback: 0-1 Demotion targets for Node 3: preferred: 0-1, fallback: 0-1 Change MEMTIER_ADISTANCE_DRAM to be a long constant by writing it with the 'L' suffix. This prevents the overflow because the multiplication will then be done in the long type which has a larger range. Link: https://lkml.kernel.org/r/20250611023439.2845785-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250610062751.2365436-1-lizhijian@fujitsu.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09alloc_tag: add sequence number for module and iteratorDavid Wang
Codetag iterator use <id,address> pair to guarantee the validness. But both id and address can be reused, there is theoretical possibility when module inserted right after another module removed, kmalloc returns an address same as the address kfree by previous module and IDR key reuses the key recently removed. Add a sequence number to codetag_module and code_iterator, the sequence number is strickly incremented whenever a module is loaded. An iterator is valid if and only if its sequence number match codetag_module's. Link: https://lkml.kernel.org/r/20250609064200.112639-1-00107082@163.com Signed-off-by: David Wang <00107082@163.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/pagewalk: split walk_page_range_novma() into kernel/user partsLorenzo Stoakes
walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Barry Song <baohua@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/filemap: allow arch to request folio size for exec memoryRyan Roberts
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ The below shows pgbench and redis benchmarks on Graviton3 arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: +----------------------+---------------+---------------+---------------+ | File-backed folios by| system boot | pgbench | redis | | size as percentage of+-------+-------+-------+-------+-------+-------+ | all mapped text mem |before | after |before | after |before | after | +======================+=======+=======+=======+=======+=======+=======+ | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% | | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% | | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% | | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% | | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% | | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% | | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% | +----------------------+-------+-------+-------+-------+-------+-------+ | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% | +----------------------+-------+-------+-------+-------+-------+-------+ The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement. "(I)" indicates a statistically significant improvement. Note TPS and Reqs/sec are rates so bigger is better, ms is time so smaller is better: +-------------+-------------------------------------------+------------+ | Benchmark | Result Class | Improvemnt | +=============+===========================================+============+ | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% | | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% | | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% | | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% | | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% | | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% | | | Scale: 100 Clients: 1 RO (TPS) | 2.51% | | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% | | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% | | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% | | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% | | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% | +-------------+-------------------------------------------+------------+ | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% | | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% | | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% | | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% | | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% | | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% | | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% | | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% | | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% | | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% | +-------------+-------------------------------------------+------------+ [ryan.roberts@arm.com: fix use-after-free] Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com [ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding] Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Will Deacon <will@kernel.org> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: store folio order in struct file_ra_stateRyan Roberts
Previously the folio order of the previous readahead request was inferred from the folio who's readahead marker was hit. But due to the way we have to round to non-natural boundaries sometimes, this first folio in the readahead block is often smaller than the preferred order for that request. This means that for cases where the initial sync readahead is poorly aligned, the folio order will ramp up much more slowly. So instead, let's store the order in struct file_ra_state so we are not affected by any required alignment. We previously made enough room in the struct for a 16 order field. This should be plenty big enough since we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger than ~20. Since we now pass order in struct file_ra_state, page_cache_ra_order() no longer needs it's new_order parameter, so let's remove that. Worked example: Here we are touching pages 17-256 sequentially just as we did in the previous commit, but now that we are remembering the preferred order explicitly, we no longer have the slow ramp up problem. Note specifically that we no longer have 2 rounds (2x ~128K) of order-2 folios: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: make space in struct file_ra_stateRyan Roberts
We need to be able to store the preferred folio order associated with a readahead request in the struct file_ra_state so that we can more accurately increase the order across subsequent readahead requests. But struct file_ra_state is per-struct file, so we don't really want to increase it's size. mmap_miss is currently 32 bits but it is only counted up to 10 * MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be plenty. Redefine it to unsigned short, making room for order as unsigned short in follow up commit. Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09proc: use the same treatment to check proc_lseek as ones for proc_read_iter ↵wangzijie
et.al Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario. It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same manner. Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()") Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: remove UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SETTal Zussman
UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since they were added in commit 932b18e0aec6 ("userfaultfd: linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON() checks. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: correctly prevent registering VM_DROPPABLE regionsTal Zussman
Patch series "mm: userfaultfd: assorted fixes and cleanups", v3. Two fixes and two cleanups for userfaultfd. Note that the third patch yields a small change in the ABI, but we seem to have concluded that it is acceptable in this case. This patch (of 4): vma_can_userfault() masks off non-userfaultfd VM flags from vm_flags. The vm_flags & VM_DROPPABLE test will then always be false, incorrectly allowing VM_DROPPABLE regions to be registered with userfaultfd. Additionally, vm_flags is not guaranteed to correspond to the actual VMA's flags. Fix this test by checking the VMA's flags directly. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-0-a7274d3bd5e4@columbia.edu Link: https://lore.kernel.org/linux-mm/5a875a3a-2243-4eab-856f-bc53ccfec3ea@redhat.com/ Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-1-a7274d3bd5e4@columbia.edu Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings") Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: rename __register_one_node() to register_one_node()Donet Tom
The register_one_node() function was a simple wrapper around __register_one_node(). To simplify the code, register_one_node() has been removed, and __register_one_node() has been renamed to register_one_node(). Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: rename register_memory_blocks_under_node() and remove ↵Donet Tom
context argument The function register_memory_blocks_under_node() is now only called from the memory hotplug path, as register_memory_blocks_under_node_early() handles registration during early boot. Therefore, the context argument used to differentiate between early boot and hotplug is no longer needed and was removed. Since the function is only called from the hotplug path, we renamed register_memory_blocks_under_node() to register_memory_blocks_under_node_hotplug() Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: remove register_memory_blocks_under_node() function call ↵Donet Tom
from register_one_node register_one_node() is now only called via cpu_up() → __try_online_node() during CPU hotplug operations to online a node. At this stage, the node has not yet had any memory added. As a result, there are no memory blocks to walk or register, so calling register_memory_blocks_under_node() is unnecessary. Therefore, the call to register_memory_blocks_under_node() has been removed from register_one_node(). Link: https://lkml.kernel.org/r/ecf07075b1a41015fcf58823997d5c2ed7b8c18f.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: optimize memory block registration to reduce boot timeDonet Tom
Patch series "drivers/base/node.c: optimization and cleanups", v7. This patch (of 7) During node device initialization, `memory blocks` are registered under each NUMA node. The `memory blocks` to be registered are identified using the node's start and end PFNs, which are obtained from the node's pg_data However, not all PFNs within this range necessarily belong to the same node—some may belong to other nodes. Additionally, due to the discontiguous nature of physical memory, certain sections within a `memory block` may be absent. As a result, `memory blocks` that fall between a node's start and end PFNs may span across multiple nodes, and some sections within those blocks may be missing. `Memory blocks` have a fixed size, which is architecture dependent. Due to these considerations, the memory block registration is currently performed as follows: for_each_online_node(nid): start_pfn = pgdat->node_start_pfn; end_pfn = pgdat->node_start_pfn + node_spanned_pages; for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn)) mem_blk = memory_block_id(pfn_to_section_nr(pfn)); pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) pfn_mb_end = pfn_start + memory_block_pfns - 1 for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): if (get_nid_for_pfn(pfn) != nid): continue; else do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); Here, we derive the start and end PFNs from the node's pg_data, then determine the memory blocks that may belong to the node. For each `memory block` in this range, we inspect all PFNs it contains and check their associated NUMA node ID. If a PFN within the block matches the current node, the memory block is registered under that node. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs a binary search in the `memblock regions` to determine the NUMA node ID for a given PFN. If it is not enabled, the node ID is retrieved directly from the struct page. On large systems, this process can become time-consuming, especially since we iterate over each `memory block` and all PFNs within it until a match is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional overhead of the binary search increases the execution time significantly, potentially leading to soft lockups during boot. In this patch, we iterate over `memblock region` to identify the `memory blocks` that belong to the current NUMA node. `memblock regions` are contiguous memory ranges, each associated with a single NUMA node, and they do not span across multiple nodes. for_each_memory_region(r): // r => region if (!node_online(r->nid)): continue; else for_each_memory_block_between(r->base, r->base + r->size - 1): do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY); We iterate over all memblock regions, and if the node associated with the region is online, we calculate the start and end memory blocks based on the region's start and end PFNs. We then register all the memory blocks within that range under the region node. Test Results on My system with 32TB RAM ======================================= 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. Without this patch ------------------ Startup finished in 1min 16.528s (kernel) With this patch --------------- Startup finished in 17.236s (kernel) - 78% Improvement 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. Without this patch ------------------ Startup finished in 28.320s (kernel) With this patch --------------- Startup finished in 15.621s (kernel) - 46% Improvement [donettom@linux.ibm.com: restore removed extra line] Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove the for_reclaim field from struct writeback_controlChristoph Hellwig
This field is now only set to one in the i915 gem code that only calls writeback_iter on it, which ignores the flag. All other checks are thuse dead code and the field can be removed. Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: stop passing a writeback_control structure to swap_writeoutChristoph Hellwig
swap_writeout only needs the swap_iocb cookie from the writeback_control structure, so pass it explicitly. Link: https://lkml.kernel.org/r/20250610054959.2057526-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: stop passing a writeback_control structure to shmem_writeoutChristoph Hellwig
shmem_writeout only needs the swap_iocb cookie and the split folio list. Pass those explicitly and remove the now unused list member from struct writeback_control. Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: rename CONFIG_PAGE_BLOCK_ORDER to CONFIG_PAGE_BLOCK_MAX_ORDERZi Yan
The config is in fact an additional upper limit of pageblock_order, so rename it to avoid confusion. Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juan Yescas <jyescas@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/gup: remove (VM_)BUG_ONsDavid Hildenbrand
Especially once we hit one of the assertions in sanity_check_pinned_pages(), observing follow-up assertions failing in other code can give good clues about what went wrong, so use VM_WARN_ON_ONCE instead. While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well. Add one comment for the pfn_valid() check. We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly. Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns something > 0 we're in bigger trouble. Convert the other BUG_ON's into VM_WARN_ON_ONCE as well, they are in a similar domain "should never happen", but more reasonable to check for during early testing. [david@redhat.com: use the _FOLIO variant where possible, per Lorenzo] Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: prevent KSM from breaking VMA merging for new VMAsLorenzo Stoakes
If a user wishes to enable KSM mergeability for an entire process and all fork/exec'd processes that come after it, they use the prctl() PR_SET_MEMORY_MERGE operation. This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set (in order to indicate they are KSM mergeable), as well as setting this flag for all existing VMAs and propagating this across fork/exec. However it also breaks VMA merging for new VMAs, both in the process and all forked (and fork/exec'd) child processes. This is because when a new mapping is proposed, the flags specified will never have VM_MERGEABLE set. However all adjacent VMAs will already have VM_MERGEABLE set, rendering VMAs unmergeable by default. To work around this, we try to set the VM_MERGEABLE flag prior to attempting a merge. In the case of brk() this can always be done. However on mmap() things are more complicated - while KSM is not supported for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE file-backed mappings. These mappings may have deprecated .mmap() callbacks specified which could, in theory, adjust flags and thus KSM eligibility. So we check to determine whether this is possible. If not, we set VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the previous behaviour. This fixes VMA merging for all new anonymous mappings, which covers the majority of real-world cases, so we should see a significant improvement in VMA mergeability. For MAP_PRIVATE file-backed mappings, those which implement the .mmap_prepare() hook and shmem are both known to be safe, so we allow these, disallowing all other cases. Also add stubs for newly introduced function invocations to VMA userland testing. [lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook] Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport! Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: ksm: have KSM VMA checks not require a VMA pointerLorenzo Stoakes
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3. When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes them available to KSM for samepage merging. It also sets VM_MERGEABLE in all existing VMAs. However this causes an issue upon mapping of new VMAs - the initial flags will never have VM_MERGEABLE set when attempting a merge with adjacent VMAs (this is set later in the mmap() logic), and adjacent VMAs will ALWAYS have VM_MERGEABLE set. This renders all newly mapped VMAs unmergeable. To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far earlier in the mmap() logic, prior to the merge being attempted. However we run into complexity with the depreciated .mmap() callback - if a driver hooks this, it might change flags which adjust KSM merge eligibility. We have to worry about this because, while KSM is only applicable to private mappings, this includes both anonymous and MAP_PRIVATE-mapped file-backed mappings. This isn't a problem for brk(), where the VMA must be anonymous. However in mmap() we must be conservative - if the VMA is anonymous then we can always proceed, however if not, we permit only shmem mappings (whose .mmap hook does not affect KSM eligibility) and drivers which implement .mmap_prepare() (invoked prior to the KSM eligibility check). If we can't be sure of the driver changing things, then we maintain the same behaviour of performing the KSM check later in the mmap() logic (and thus losing new VMA mergeability). A great many use-cases for this logic will use anonymous mappings any rate, so this change should already cover the majority of actual KSM use-cases. This patch (of 4): In subsequent commits we are going to determine KSM eligibility prior to a VMA being constructed, at which point we will of course not yet have access to a VMA pointer. It is trivial to boil down the check logic to be parameterised on mm_struct, file and VMA flags, so do so. As a part of this change, additionally expose and use file_is_dax() to determine whether a file is being mapped under a DAX inode. Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "sched/numa: add statistics of numa balance task"Chen Yu
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: fix the inaccurate memory statistics issue for usersBaolin Wang
On some large machines with a high number of CPUs running a 64K pagesize kernel, we found that the 'RES' field is always 0 displayed by the top command for some processes, which will cause a lot of confusion for users. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd The main reason is that the batch size of the percpu counter is quite large on these machines, caching a significant percpu value, since converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"). Intuitively, the batch number should be optimized, but on some paths, performance may take precedence over statistical accuracy. Therefore, introducing a new interface to add the percpu statistical count and display it to users, which can remove the confusion. In addition, this change is not expected to be on a performance-critical path, so the modification should be acceptable. In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock contention due to the percpu batch caching of the mm counters. The following test also confirm the theoretical analysis. I run the stress-ng that stresses anon page faults in 32 threads on my 32 cores machine, while simultaneously running a script that starts 32 threads to busy-loop pread each stress-ng thread's /proc/pid/status interface. From the following data, I did not observe any obvious impact of this patch on the stress-ng tests. w/o patch: stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle) stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec w/patch: stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle) stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec On comparing a very simple app which just allocates & touches some memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus tree (4c06e63b9203) I can see that on latest Linus tree the values for VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while they do report values on v6.1 and a Linus tree with this patch. Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by Donet Tom <donettom@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09net: phy: bcm5481x: MII-Lite activationKamil Horák - 2N
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The absence of COL signal makes half-duplex link modes impossible, however, the BroadR-Reach modes are all full-duplex only. Depending on the IC encapsulation, there exist MII-Lite-only PHYs such as bcm54811 in MLP. The PHY itself is hardware-strapped to select among multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also activated by software. Add MII-Lite activation for bcm5481x PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: phy: MII-Lite PHY interface modeKamil Horák - 2N
Some Broadcom PHYs are capable to operate in simplified MII mode, without TXER, RXER, CRS and COL signals as defined for the MII. The MII-Lite mode can be used on most Ethernet controllers with full MII interface by just leaving the input signals (RXER, CRS, COL) inactive. The absence of COL signal makes half-duplex link modes impossible but does not interfere with BroadR-Reach link modes on Broadcom PHYs, because they are all full-duplex only. Add MII-Lite interface mode, especially for Broadcom two-wire PHYs. Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09rv: Allow to configure the number of per-task monitorNam Cao
Now that there are 2 monitors for real-time applications, users may want to enable both of them simultaneously. Make the number of per-task monitor configurable. Default it to 2 for now. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add support for LTL monitorsNam Cao
While attempting to implement DA monitors for some complex specifications, deterministic automaton is found to be inappropriate as the specification language. The automaton is complicated, hard to understand, and error-prone. For these cases, linear temporal logic is more suitable as the specification language. Add support for linear temporal logic runtime verification monitor. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Let the reactors take care of buffersNam Cao
Each RV monitor has one static buffer to send to the reactors. If multiple errors are detected simultaneously, the one buffer could be overwritten. Instead, leave it to the reactors to handle buffering. Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09panic: Add vpanic()Nam Cao
vpanic() is useful for implementing runtime verification reactors. Add it. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09printk: Make vprintk_deferred() publicNam Cao
vprintk_deferred() is useful for implementing runtime verification reactors. Make it public. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09sched/fair: Fix NO_RUN_TO_PARITY caseVincent Guittot
EEVDF expects the scheduler to allocate a time quantum to the selected entity and then pick a new entity for next quantum. Although this notion of time quantum is not strictly doable in our case, we can ensure a minimum runtime for each task most of the time and pick a new entity after a minimum time has elapsed. Reuse the slice protection of run to parity to ensure such runtime quantum. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
2025-07-09sched/deadline: Less agressive dl_server handlingPeter Zijlstra
Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default bandwidth control") caused a significant dip in his favourite benchmark of the day. Simply disabling dl_server cured things. His workload hammers the 0->1, 1->0 transitions, and the dl_server_{start,stop}() overhead kills it -- fairly obviously a bad idea in hind sight and all that. Change things around to only disable the dl_server when there has not been a fair task around for a whole period. Since the default period is 1 second, this ensures the benchmark never trips this, overhead gone. Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
2025-07-09sched/psi: Optimize psi_group_change() cpu_clock() usagePeter Zijlstra
Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") caused a regression for him on a high context switch rate benchmark (schbench) due to the now repeating cpu_clock() calls. In particular the problem is that get_recent_times() will extrapolate the current state to 'now'. But if an update uses a timestamp from before the start of the update, it is possible to get two reads with inconsistent results. It is effectively back-dating an update. (note that this all hard-relies on the clock being synchronized across CPUs -- if this is not the case, all bets are off). Combine this problem with the fact that there are per-group-per-cpu seqcounts, the commit in question pushed the clock read into the group iteration, causing tree-depth cpu_clock() calls. On architectures where cpu_clock() has appreciable overhead, this hurts. Instead move to a per-cpu seqcount, which allows us to have a single clock read for all group updates, increasing internal consistency and lowering update overhead. This comes at the cost of a longer update side (proportional to the tree depth) which can cause the read side to retry more often. Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race") Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>, Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
2025-07-09debugfs_get_aux(): allow storing non-const void *Al Viro
typechecking is up to users, anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09debugfs: split short and full proxy wrappers, kill debugfs_real_fops()Al Viro
All users outside of fs/debugfs/file.c are gone, in there we can just fully split the wrappers for full and short cases and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09pmdomain: core: Leave powered-on genpds on until late_initcall_syncUlf Hansson
Powering-off a genpd that was on during boot, before all of its consumer devices have been probed, is certainly prone to problems. As a step to improve this situation, let's prevent these genpds from being powered-off until genpd_power_off_unused() gets called, which is a late_initcall_sync(). Note that, this still doesn't guarantee that all the consumer devices has been probed before we allow to power-off the genpds. Yet, this should be a step in the right direction. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-22-ulf.hansson@linaro.org
2025-07-09driver core: Add dev_set_drv_sync_state()Saravana Kannan
This can be used by frameworks to set the sync_state() helper functions for drivers that don't already have them set. Signed-off-by: Saravana Kannan <saravanak@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-20-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Add common ->sync_state() support for genpd providersUlf Hansson
If the genpd provider's fwnode doesn't have an associated struct device with it, we can make use of the generic genpd->dev and it corresponding driver internally in genpd to manage ->sync_state(). More precisely, while adding a genpd OF provider let's check if the fwnode has a device and if not, make the preparation to handle ->sync_state() internally through the genpd_provider_driver and the genpd_provider_bus. Note that, genpd providers may opt out from this behaviour by setting the GENPD_FLAG_NO_SYNC_STATE config options for the genpds in question. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-19-ulf.hansson@linaro.org
2025-07-09driver core: Export get_dev_from_fwnode()Ulf Hansson
It has turned out get_dev_from_fwnode() is useful at a few other places outside of the driver core, as in gpiolib.c for example. Therefore let's make it available as a common helper function. Suggested-by: Saravana Kannan <saravanak@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-18-ulf.hansson@linaro.org
2025-07-09firmware: xilinx: Don't share zynqmp_pm_init_finalize()Ulf Hansson
As there no longer any users outside the zynqmp firmware driver of zynqmp_pm_init_finalize(), let's turn into a local static function. Cc: Michal Simek <michal.simek@amd.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-16-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Prepare to add the common ->sync_state() supportUlf Hansson
Before we can implement the common ->sync_state() support in genpd, we need to allow a few specific genpd providers to opt out from the new behaviour. Let's introduce GENPD_FLAG_NO_SYNC_STATE as a new genpd config option, to allow providers to opt out. Suggested-by: Saravana Kannan <saravanak@google.com> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-9-ulf.hansson@linaro.org
2025-07-09pmdomain: core: Export a common ->sync_state() helper for genpd providersUlf Hansson
In some cases the typical platform driver that act as genpd provider, may need its own ->sync_state() callback to manage various things. In this regards, the provider most likely wants to allow its corresponding genpds to be powered-off. For this reason, let's introduce a new genpd helper function, of_genpd_sync_state() that helps genpd provider drivers to achieve this. Suggested-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106 Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20250701114733.636510-8-ulf.hansson@linaro.org
2025-07-09usb: core: add dma-noncoherent buffer alloc and free APIXu Yang
This will add usb_alloc_noncoherent() and usb_free_noncoherent() functions to support alloc and free buffer in a dma-noncoherent way. To explicit manage the memory ownership for the kernel and device, this will also add usb_dma_noncoherent_sync_for_cpu/device() functions and call it at proper time. The management requires the user save sg_table returned by usb_alloc_noncoherent() to urb->sgt. Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20250704095751.73765-2-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupportSomashekhar Puttagangaiah
Implement dot11ExtendedRegInfoSupport to advertise non-AP station regulatory power capability as part of regulatory connectivity element in (Re)Association request frames so that AP can achieve maximum client connectivity. Control field which was interpreted using value of 3-bits B5 to B3, now uses value of 4-bits B6 to B3 to interpret the type of AP. Hence update IEEE80211_HE_6GHZ_OPER_CTRL_REG_INFO to parse 4-bits control field. If older AP still updates only 3-bits value of control field, station can still interpret the value as per section E.2.7 of IEEE 802.11 REVme D7.0 and support the appropriate AP type. Also update IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP as the value of standard power AP is changed to 8 instead of 4 so that AP can support both LPI AP and SP AP to maximize the connectivity with stations. For backward compatibility, keeping value 4 as old AP by limiting it to SP AP only. Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213232.90cdef116aad.I85da390fbee59355e3855691933e6a5e55c47ac4@changeid [fix kernel-doc] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09wifi: cfg80211: only verify part of Extended MLD CapabilitiesBenjamin Berg
We verify that the Extended MLD Capabilities are matching between links. However, some bits are reserved and in particular the Recommended Max Links subfield may not necessarily match. So only verify the known subfields that can reliably be expected to be the same. More information can be found in Table 9-417o, in IEEE P802.11be/D7.0. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250609213231.a2fad48dd3e6.Iae1740cd2ac833bc4a64fd2af718e1485158fd42@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-09Merge v6.16-rc2 into timers/ptpThomas Gleixner
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail to compile for compat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>