Age | Commit message (Collapse) | Author |
|
Patch series "use vm_flags_t consistently".
The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this
is exactly equivalent to unsigned long, but it should not be assumed to
be.
Much code that references vma->vm_flags already correctly uses vm_flags_t,
but a fairly large chunk of code simply uses unsigned long and assumes
that the two are equivalent.
This series corrects that and has us use vm_flags_t consistently.
This series is motivated by the desire to, in a future series, adjust
vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or
64-bit in order to deal with the VMA flag exhaustion issue and avoid all
the various problems that arise from it (being unable to use certain
features in 32-bit, being unable to add new flags except for 64-bit, etc.)
This is therefore a critical first step towards that goal. At any rate,
using the correct type is of value regardless.
We additionally take the opportunity to refer to VMA flags as vm_flags
where possible to make clear what we're referring to.
Overall, this series does not introduce any functional change.
This patch (of 3):
We abstract the type of the VMA flags to vm_flags_t, however in may places
it is simply assumed this is unsigned long, which is simply incorrect.
At the moment this is simply an incongruity, however in future we plan to
change this type and therefore this change is a critical requirement for
doing so.
Overall, this patch does not introduce any functional change.
[lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include]
Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local
Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Call folio_address() instead of page_address().
Link: https://lkml.kernel.org/r/20250613194825.3175276-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Marking PUDs that map a "normal" refcounted folios as special is against
our rules documented for vm_normal_page(). normal (refcounted) folios
shall never have the page table mapping marked as special.
Fortunately, there are not that many pud_special() check that can be
mislead and are right now rather harmless: e.g., none so far bases
decisions whether to grab a folio reference on that decision.
Well, and GUP-fast will fallback to GUP-slow. All in all, so far no big
implications as it seems.
Getting this right will get more important as we introduce
folio_normal_page_pud() and start using it in more place where we
currently special-case based on other VMA flags.
Fix it just like we fixed vmf_insert_folio_pmd().
Add folio_mk_pud() to mimic what we do with folio_mk_pmd().
Link: https://lkml.kernel.org/r/20250613092702.1943533-4-david@redhat.com
Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All users have now been converted to either memzero_page() or
folio_zero_range().
Link: https://lkml.kernel.org/r/20250612143443.2848197-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MADV_FREE is another option, besides MADV_DONTNEED, for dynamic memory
freeing in user-space native or Java heap memory management. For example,
jemalloc can be configured to use MADV_FREE, and recent versions of the
Android Java heap have also increasingly adopted MADV_FREE. Supporting
per-VMA locking for MADV_FREE thus appears increasingly necessary.
We have replaced walk_page_range() with walk_page_range_vma(). Along with
the proposed madvise_lock_mode by Lorenzo, the necessary infrastructure is
now in place to begin exploring per-VMA locking support for MADV_FREE and
potentially other madvise using walk_page_range_vma().
This patch adds support for the PGWALK_VMA_RDLOCK walk_lock mode in
walk_page_range_vma(), and leverages madvise_lock_mode from madv_behavior
to select the appropriate walk_lock—either mmap_lock or per-VMA
lock—based on the context.
Because we now dynamically update the walk_ops->walk_lock field, we must
ensure this is thread-safe. The madvise_free_walk_ops is now defined as a
stack variable instead of a global constant.
Link: https://lkml.kernel.org/r/20250611104745.57405-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In mt_perf_to_adistance(), the calculation of abstract distance (adist)
involves multiplying several int values including
MEMTIER_ADISTANCE_DRAM.
*adist = MEMTIER_ADISTANCE_DRAM *
(perf->read_latency + perf->write_latency) /
(default_dram_perf.read_latency + default_dram_perf.write_latency) *
(default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) /
(perf->read_bandwidth + perf->write_bandwidth);
Since these values can be large, the multiplication may exceed the
maximum value of an int (INT_MAX) and overflow (Our platform did),
leading to an incorrect adist.
User-visible impact:
The memory tiering subsystem will misinterpret slow memory (like CXL)
as faster than DRAM, causing inappropriate demotion of pages from
CXL (slow memory) to DRAM (fast memory).
For example, we will see the following demotion chains from the dmesg, where
Node0,1 are DRAM, and Node2,3 are CXL node:
Demotion targets for Node 0: null
Demotion targets for Node 1: null
Demotion targets for Node 2: preferred: 0-1, fallback: 0-1
Demotion targets for Node 3: preferred: 0-1, fallback: 0-1
Change MEMTIER_ADISTANCE_DRAM to be a long constant by writing it with
the 'L' suffix. This prevents the overflow because the multiplication
will then be done in the long type which has a larger range.
Link: https://lkml.kernel.org/r/20250611023439.2845785-1-lizhijian@fujitsu.com
Link: https://lkml.kernel.org/r/20250610062751.2365436-1-lizhijian@fujitsu.com
Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Codetag iterator use <id,address> pair to guarantee the validness. But
both id and address can be reused, there is theoretical possibility when
module inserted right after another module removed, kmalloc returns an
address same as the address kfree by previous module and IDR key reuses
the key recently removed.
Add a sequence number to codetag_module and code_iterator, the sequence
number is strickly incremented whenever a module is loaded. An iterator
is valid if and only if its sequence number match codetag_module's.
Link: https://lkml.kernel.org/r/20250609064200.112639-1-00107082@163.com
Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
walk_page_range_novma() is rather confusing - it supports two modes, one
used often, the other used only for debugging.
The first mode is the common case of traversal of kernel page tables,
which is what nearly all callers use this for.
Secondly it provides an unusual debugging interface that allows for the
traversal of page tables in a userland range of memory even for that
memory which is not described by a VMA.
It is far from certain that such page tables should even exist, but
perhaps this is precisely why it is useful as a debugging mechanism.
As a result, this is utilised by ptdump only. Historically, things were
reversed - ptdump was the only user, and other parts of the kernel evolved
to use the kernel page table walking here.
Since we have some complicated and confusing locking rules for the novma
case, it makes sense to separate the two usages into their own functions.
Doing this also provide self-documentation as to the intent of the caller
- are they doing something rather unusual or are they simply doing a
standard kernel page table walk?
We therefore establish two separate functions - walk_page_range_debug()
for this single usage, and walk_kernel_page_table_range() for general
kernel page table walking.
The walk_page_range_debug() function is currently used to traverse both
userland and kernel mappings, so we maintain this and in the case of
kernel mappings being traversed, we have walk_page_range_debug() invoke
walk_kernel_page_table_range() internally.
We additionally make walk_page_range_debug() internal to mm.
Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Barry Song <baohua@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner. We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary. Additionally, we don't bother with an
asynchronous part.
On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization of
the TLB. When paired with the "multi-size THP" feature, this works well
to reduce dTLB pressure. However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required folio
size and mapping alignment, even when the filesystem supports readahead
into large folios (e.g. XFS).
The reason for the low likelihood is that the current readahead algorithm
starts with an order-0 folio and increases the folio order by 2 every time
the readahead mark is hit. But most executable memory tends to be
accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.
So let's special-case the read(ahead) logic for executable mappings. The
trade-off is performance improvement (due to more efficient storage of the
translations in iTLB) vs potential for making reclaim more difficult (due
to the folios being larger so if a part of the folio is hot the whole
thing is considered hot). But executable memory is a small portion of the
overall system memory so I doubt this will even register from a reclaim
perspective.
I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs. Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues. I
don't anticipate any write amplification because text is always RO.
Note that the text region of an ELF file could be populated into the page
cache for other reasons than taking a fault in a mmapped area. The most
common case is due to the loader read()ing the header which can be shared
with the beginning of text. So some text will still remain in small
folios, but this simple, best effort change provides good performance
improvements as is.
Confine this special-case approach to the bounds of the VMA. This
prevents wasting memory for any padding that might exist in the file
between sections. Previously the padding would have been contained in
order-0 folios and would be easy to reclaim. But now it would be part of
a larger folio so more difficult to reclaim. Solve this by simply not
reading it into memory in the first place.
Benchmarking
============
The below shows pgbench and redis benchmarks on Graviton3 arm64 system.
First, confirmation that this patch causes more text to be contained in
64K folios:
+----------------------+---------------+---------------+---------------+
| File-backed folios by| system boot | pgbench | redis |
| size as percentage of+-------+-------+-------+-------+-------+-------+
| all mapped text mem |before | after |before | after |before | after |
+======================+=======+=======+=======+=======+=======+=======+
| base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% |
| thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% |
| thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% |
| thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% |
| thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% |
| thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% |
| thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% |
| thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% |
| thp-partial | 0% | 0% | 0% | 1% | 0% | 1% |
+----------------------+-------+-------+-------+-------+-------+-------+
| cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% |
+----------------------+-------+-------+-------+-------+-------+-------+
The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed by
4K and 16K folios reduces and the amount backed by 64K folios increases
significantly. And the amount of text that is contpte-mapped
significantly increases (see last row).
And this is reflected in performance improvement. "(I)" indicates a
statistically significant improvement. Note TPS and Reqs/sec are rates so
bigger is better, ms is time so smaller is better:
+-------------+-------------------------------------------+------------+
| Benchmark | Result Class | Improvemnt |
+=============+===========================================+============+
| pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% |
| | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% |
| | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% |
| | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% |
| | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% |
| | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% |
| | Scale: 100 Clients: 1 RO (TPS) | 2.51% |
| | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% |
| | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% |
| | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
| | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% |
| | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
+-------------+-------------------------------------------+------------+
| pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% |
| | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% |
| | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% |
| | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% |
| | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% |
| | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% |
| | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% |
| | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% |
| | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% |
| | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% |
+-------------+-------------------------------------------+------------+
[ryan.roberts@arm.com: fix use-after-free]
Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com
[ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding]
Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com
Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Will Deacon <will@kernel.org>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Previously the folio order of the previous readahead request was inferred
from the folio who's readahead marker was hit. But due to the way we have
to round to non-natural boundaries sometimes, this first folio in the
readahead block is often smaller than the preferred order for that
request. This means that for cases where the initial sync readahead is
poorly aligned, the folio order will ramp up much more slowly.
So instead, let's store the order in struct file_ra_state so we are not
affected by any required alignment. We previously made enough room in the
struct for a 16 order field. This should be plenty big enough since we
are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger
than ~20.
Since we now pass order in struct file_ra_state, page_cache_ra_order() no
longer needs it's new_order parameter, so let's remove that.
Worked example:
Here we are touching pages 17-256 sequentially just as we did in the
previous commit, but now that we are remembering the preferred order
explicitly, we no longer have the slow ramp up problem. Note specifically
that we no longer have 2 rounds (2x ~128K) of order-2 folios:
TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA
----- ---------- ---------- ---------- ------- ------- ----- ----- --
HOLE 0x00000000 0x00001000 4096 0 1 1
FOLIO 0x00001000 0x00002000 4096 1 2 1 0
FOLIO 0x00002000 0x00003000 4096 2 3 1 0
FOLIO 0x00003000 0x00004000 4096 3 4 1 0
FOLIO 0x00004000 0x00005000 4096 4 5 1 0
FOLIO 0x00005000 0x00006000 4096 5 6 1 0
FOLIO 0x00006000 0x00007000 4096 6 7 1 0
FOLIO 0x00007000 0x00008000 4096 7 8 1 0
FOLIO 0x00008000 0x00009000 4096 8 9 1 0
FOLIO 0x00009000 0x0000a000 4096 9 10 1 0
FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0
FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0
FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0
FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0
FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0
FOLIO 0x0000f000 0x00010000 4096 15 16 1 0
FOLIO 0x00010000 0x00011000 4096 16 17 1 0
FOLIO 0x00011000 0x00012000 4096 17 18 1 0
FOLIO 0x00012000 0x00013000 4096 18 19 1 0
FOLIO 0x00013000 0x00014000 4096 19 20 1 0
FOLIO 0x00014000 0x00015000 4096 20 21 1 0
FOLIO 0x00015000 0x00016000 4096 21 22 1 0
FOLIO 0x00016000 0x00017000 4096 22 23 1 0
FOLIO 0x00017000 0x00018000 4096 23 24 1 0
FOLIO 0x00018000 0x00019000 4096 24 25 1 0
FOLIO 0x00019000 0x0001a000 4096 25 26 1 0
FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0
FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0
FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0
FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0
FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0
FOLIO 0x0001f000 0x00020000 4096 31 32 1 0
FOLIO 0x00020000 0x00021000 4096 32 33 1 0
FOLIO 0x00021000 0x00022000 4096 33 34 1 0
FOLIO 0x00022000 0x00024000 8192 34 36 2 1
FOLIO 0x00024000 0x00028000 16384 36 40 4 2
FOLIO 0x00028000 0x0002c000 16384 40 44 4 2
FOLIO 0x0002c000 0x00030000 16384 44 48 4 2
FOLIO 0x00030000 0x00034000 16384 48 52 4 2
FOLIO 0x00034000 0x00038000 16384 52 56 4 2
FOLIO 0x00038000 0x0003c000 16384 56 60 4 2
FOLIO 0x0003c000 0x00040000 16384 60 64 4 2
FOLIO 0x00040000 0x00050000 65536 64 80 16 4
FOLIO 0x00050000 0x00060000 65536 80 96 16 4
FOLIO 0x00060000 0x00080000 131072 96 128 32 5
FOLIO 0x00080000 0x000a0000 131072 128 160 32 5
FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5
FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5
FOLIO 0x000e0000 0x00100000 131072 224 256 32 5
FOLIO 0x00100000 0x00120000 131072 256 288 32 5
FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y
HOLE 0x00140000 0x00800000 7077888 320 2048 1728
Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We need to be able to store the preferred folio order associated with a
readahead request in the struct file_ra_state so that we can more
accurately increase the order across subsequent readahead requests. But
struct file_ra_state is per-struct file, so we don't really want to
increase it's size.
mmap_miss is currently 32 bits but it is only counted up to 10 *
MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be
plenty. Redefine it to unsigned short, making room for order as unsigned
short in follow up commit.
Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
et.al
Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario.
It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in
proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same
manner.
Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com
Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()")
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since
they were added in commit 932b18e0aec6 ("userfaultfd:
linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON()
checks.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: userfaultfd: assorted fixes and cleanups", v3.
Two fixes and two cleanups for userfaultfd.
Note that the third patch yields a small change in the ABI, but we seem to
have concluded that it is acceptable in this case.
This patch (of 4):
vma_can_userfault() masks off non-userfaultfd VM flags from vm_flags. The
vm_flags & VM_DROPPABLE test will then always be false, incorrectly
allowing VM_DROPPABLE regions to be registered with userfaultfd.
Additionally, vm_flags is not guaranteed to correspond to the actual VMA's
flags. Fix this test by checking the VMA's flags directly.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-0-a7274d3bd5e4@columbia.edu
Link: https://lore.kernel.org/linux-mm/5a875a3a-2243-4eab-856f-bc53ccfec3ea@redhat.com/
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-1-a7274d3bd5e4@columbia.edu
Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings")
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The register_one_node() function was a simple wrapper around
__register_one_node(). To simplify the code, register_one_node() has been
removed, and __register_one_node() has been renamed to
register_one_node().
Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
context argument
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
from register_one_node
register_one_node() is now only called via cpu_up() →
__try_online_node() during CPU hotplug operations to online a node.
At this stage, the node has not yet had any memory added. As a result,
there are no memory blocks to walk or register, so calling
register_memory_blocks_under_node() is unnecessary.
Therefore, the call to register_memory_blocks_under_node() has been
removed from register_one_node().
Link: https://lkml.kernel.org/r/ecf07075b1a41015fcf58823997d5c2ed7b8c18f.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "drivers/base/node.c: optimization and cleanups", v7.
This patch (of 7)
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node's start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a `memory
block` may be absent.
As a result, `memory blocks` that fall between a node's start and end PFNs
may span across multiple nodes, and some sections within those blocks may
be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each `memory
block` in this range, we inspect all PFNs it contains and check their
associated NUMA node ID. If a PFN within the block matches the current
node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match
is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the
additional overhead of the binary search increases the execution time
significantly, potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the `memory
blocks` that belong to the current NUMA node. `memblock regions` are
contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on
the region's start and end PFNs. We then register all the memory blocks
within that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
[donettom@linux.ibm.com: restore removed extra line]
Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag. All other checks are thuse
dead code and the field can be removed.
Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swap_writeout only needs the swap_iocb cookie from the writeback_control
structure, so pass it explicitly.
Link: https://lkml.kernel.org/r/20250610054959.2057526-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
shmem_writeout only needs the swap_iocb cookie and the split folio list.
Pass those explicitly and remove the now unused list member from struct
writeback_control.
Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.
Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Juan Yescas <jyescas@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Especially once we hit one of the assertions in
sanity_check_pinned_pages(), observing follow-up assertions failing in
other code can give good clues about what went wrong, so use
VM_WARN_ON_ONCE instead.
While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well.
Add one comment for the pfn_valid() check.
We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly.
Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns
something > 0 we're in bigger trouble. Convert the other BUG_ON's into
VM_WARN_ON_ONCE as well, they are in a similar domain "should never
happen", but more reasonable to check for during early testing.
[david@redhat.com: use the _FOLIO variant where possible, per Lorenzo]
Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com
Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If a user wishes to enable KSM mergeability for an entire process and all
fork/exec'd processes that come after it, they use the prctl()
PR_SET_MEMORY_MERGE operation.
This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
(in order to indicate they are KSM mergeable), as well as setting this
flag for all existing VMAs and propagating this across fork/exec.
However it also breaks VMA merging for new VMAs, both in the process and
all forked (and fork/exec'd) child processes.
This is because when a new mapping is proposed, the flags specified will
never have VM_MERGEABLE set. However all adjacent VMAs will already have
VM_MERGEABLE set, rendering VMAs unmergeable by default.
To work around this, we try to set the VM_MERGEABLE flag prior to
attempting a merge. In the case of brk() this can always be done.
However on mmap() things are more complicated - while KSM is not supported
for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE
file-backed mappings.
These mappings may have deprecated .mmap() callbacks specified which
could, in theory, adjust flags and thus KSM eligibility.
So we check to determine whether this is possible. If not, we set
VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
previous behaviour.
This fixes VMA merging for all new anonymous mappings, which covers the
majority of real-world cases, so we should see a significant improvement
in VMA mergeability.
For MAP_PRIVATE file-backed mappings, those which implement the
.mmap_prepare() hook and shmem are both known to be safe, so we allow
these, disallowing all other cases.
Also add stubs for newly introduced function invocations to VMA userland
testing.
[lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook]
Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local
Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com
Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport!
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3.
When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this
defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes
them available to KSM for samepage merging. It also sets VM_MERGEABLE in
all existing VMAs.
However this causes an issue upon mapping of new VMAs - the initial flags
will never have VM_MERGEABLE set when attempting a merge with adjacent
VMAs (this is set later in the mmap() logic), and adjacent VMAs will
ALWAYS have VM_MERGEABLE set.
This renders all newly mapped VMAs unmergeable.
To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far
earlier in the mmap() logic, prior to the merge being attempted.
However we run into complexity with the depreciated .mmap() callback - if
a driver hooks this, it might change flags which adjust KSM merge
eligibility.
We have to worry about this because, while KSM is only applicable to
private mappings, this includes both anonymous and MAP_PRIVATE-mapped
file-backed mappings.
This isn't a problem for brk(), where the VMA must be anonymous. However
in mmap() we must be conservative - if the VMA is anonymous then we can
always proceed, however if not, we permit only shmem mappings (whose .mmap
hook does not affect KSM eligibility) and drivers which implement
.mmap_prepare() (invoked prior to the KSM eligibility check).
If we can't be sure of the driver changing things, then we maintain the
same behaviour of performing the KSM check later in the mmap() logic (and
thus losing new VMA mergeability).
A great many use-cases for this logic will use anonymous mappings any
rate, so this change should already cover the majority of actual KSM
use-cases.
This patch (of 4):
In subsequent commits we are going to determine KSM eligibility prior to a
VMA being constructed, at which point we will of course not yet have
access to a VMA pointer.
It is trivial to boil down the check logic to be parameterised on
mm_struct, file and VMA flags, so do so.
As a part of this change, additionally expose and use file_is_dax() to
determine whether a file is being mapped under a DAX inode.
Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.
This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.
CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p
# p exit:
exit_signals(p);
p->flags |= PF_EXITING
exit_mm
p->mm = NULL;
migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.
Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter"). Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy. Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion. In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.
In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters. The
following test also confirm the theoretical analysis.
I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface. From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.
w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec
w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec
On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.
Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually
capable to operate in simplified MII mode, without TXER, RXER, CRS and
COL signals as defined for the MII. The absence of COL signal makes
half-duplex link modes impossible, however, the BroadR-Reach modes are
all full-duplex only.
Depending on the IC encapsulation, there exist MII-Lite-only PHYs such
as bcm54811 in MLP. The PHY itself is hardware-strapped to select among
multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also
activated by software.
Add MII-Lite activation for bcm5481x PHYs.
Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some Broadcom PHYs are capable to operate in simplified MII mode,
without TXER, RXER, CRS and COL signals as defined for the MII.
The MII-Lite mode can be used on most Ethernet controllers with full
MII interface by just leaving the input signals (RXER, CRS, COL)
inactive. The absence of COL signal makes half-duplex link modes
impossible but does not interfere with BroadR-Reach link modes on
Broadcom PHYs, because they are all full-duplex only.
Add MII-Lite interface mode, especially for Broadcom two-wire PHYs.
Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that there are 2 monitors for real-time applications, users may want to
enable both of them simultaneously. Make the number of per-task monitor
configurable. Default it to 2 for now.
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
While attempting to implement DA monitors for some complex specifications,
deterministic automaton is found to be inappropriate as the specification
language. The automaton is complicated, hard to understand, and
error-prone.
For these cases, linear temporal logic is more suitable as the
specification language.
Add support for linear temporal logic runtime verification monitor.
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Each RV monitor has one static buffer to send to the reactors. If multiple
errors are detected simultaneously, the one buffer could be overwritten.
Instead, leave it to the reactors to handle buffering.
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
vpanic() is useful for implementing runtime verification reactors. Add it.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
vprintk_deferred() is useful for implementing runtime verification
reactors. Make it public.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
EEVDF expects the scheduler to allocate a time quantum to the selected
entity and then pick a new entity for next quantum.
Although this notion of time quantum is not strictly doable in our case,
we can ensure a minimum runtime for each task most of the time and pick a
new entity after a minimum time has elapsed.
Reuse the slice protection of run to parity to ensure such runtime
quantum.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
|
|
Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.
His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.
Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.
Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
|
|
Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
pressure spikes from aggregation race") caused a regression for him on
a high context switch rate benchmark (schbench) due to the now
repeating cpu_clock() calls.
In particular the problem is that get_recent_times() will extrapolate
the current state to 'now'. But if an update uses a timestamp from
before the start of the update, it is possible to get two reads
with inconsistent results. It is effectively back-dating an update.
(note that this all hard-relies on the clock being synchronized across
CPUs -- if this is not the case, all bets are off).
Combine this problem with the fact that there are per-group-per-cpu
seqcounts, the commit in question pushed the clock read into the group
iteration, causing tree-depth cpu_clock() calls. On architectures
where cpu_clock() has appreciable overhead, this hurts.
Instead move to a per-cpu seqcount, which allows us to have a single
clock read for all group updates, increasing internal consistency and
lowering update overhead. This comes at the cost of a longer update
side (proportional to the tree depth) which can cause the read side to
retry more often.
Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
|
|
typechecking is up to users, anyway...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
All users outside of fs/debugfs/file.c are gone, in there we can just
fully split the wrappers for full and short cases and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Powering-off a genpd that was on during boot, before all of its consumer
devices have been probed, is certainly prone to problems.
As a step to improve this situation, let's prevent these genpds from being
powered-off until genpd_power_off_unused() gets called, which is a
late_initcall_sync().
Note that, this still doesn't guarantee that all the consumer devices has
been probed before we allow to power-off the genpds. Yet, this should be a
step in the right direction.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-22-ulf.hansson@linaro.org
|
|
This can be used by frameworks to set the sync_state() helper functions
for drivers that don't already have them set.
Signed-off-by: Saravana Kannan <saravanak@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-20-ulf.hansson@linaro.org
|
|
If the genpd provider's fwnode doesn't have an associated struct device
with it, we can make use of the generic genpd->dev and it corresponding
driver internally in genpd to manage ->sync_state().
More precisely, while adding a genpd OF provider let's check if the fwnode
has a device and if not, make the preparation to handle ->sync_state()
internally through the genpd_provider_driver and the genpd_provider_bus.
Note that, genpd providers may opt out from this behaviour by setting the
GENPD_FLAG_NO_SYNC_STATE config options for the genpds in question.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-19-ulf.hansson@linaro.org
|
|
It has turned out get_dev_from_fwnode() is useful at a few other places
outside of the driver core, as in gpiolib.c for example. Therefore let's
make it available as a common helper function.
Suggested-by: Saravana Kannan <saravanak@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-18-ulf.hansson@linaro.org
|
|
As there no longer any users outside the zynqmp firmware driver of
zynqmp_pm_init_finalize(), let's turn into a local static function.
Cc: Michal Simek <michal.simek@amd.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-16-ulf.hansson@linaro.org
|
|
Before we can implement the common ->sync_state() support in genpd, we need
to allow a few specific genpd providers to opt out from the new behaviour.
Let's introduce GENPD_FLAG_NO_SYNC_STATE as a new genpd config option, to
allow providers to opt out.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-9-ulf.hansson@linaro.org
|
|
In some cases the typical platform driver that act as genpd provider, may
need its own ->sync_state() callback to manage various things. In this
regards, the provider most likely wants to allow its corresponding genpds
to be powered-off.
For this reason, let's introduce a new genpd helper function,
of_genpd_sync_state() that helps genpd provider drivers to achieve this.
Suggested-by: Saravana Kannan <saravanak@google.com>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-8-ulf.hansson@linaro.org
|
|
This will add usb_alloc_noncoherent() and usb_free_noncoherent()
functions to support alloc and free buffer in a dma-noncoherent way.
To explicit manage the memory ownership for the kernel and device,
this will also add usb_dma_noncoherent_sync_for_cpu/device() functions
and call it at proper time. The management requires the user save
sg_table returned by usb_alloc_noncoherent() to urb->sgt.
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/r/20250704095751.73765-2-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Implement dot11ExtendedRegInfoSupport to advertise non-AP station
regulatory power capability as part of regulatory connectivity
element in (Re)Association request frames so that AP can achieve
maximum client connectivity. Control field which was interpreted
using value of 3-bits B5 to B3, now uses value of 4-bits B6 to B3 to
interpret the type of AP. Hence update IEEE80211_HE_6GHZ_OPER_CTRL_REG_INFO
to parse 4-bits control field. If older AP still updates only 3-bits
value of control field, station can still interpret the value as per
section E.2.7 of IEEE 802.11 REVme D7.0 and support the appropriate
AP type.
Also update IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP as the value of
standard power AP is changed to 8 instead of 4 so that AP can support both
LPI AP and SP AP to maximize the connectivity with stations. For backward
compatibility, keeping value 4 as old AP by limiting it to SP AP only.
Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.90cdef116aad.I85da390fbee59355e3855691933e6a5e55c47ac4@changeid
[fix kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
We verify that the Extended MLD Capabilities are matching between links.
However, some bits are reserved and in particular the Recommended Max
Links subfield may not necessarily match. So only verify the known
subfields that can reliably be expected to be the same. More information
can be found in Table 9-417o, in IEEE P802.11be/D7.0.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.a2fad48dd3e6.Iae1740cd2ac833bc4a64fd2af718e1485158fd42@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail
to compile for compat.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|