Age | Commit message (Collapse) | Author |
|
The register_one_node() function was a simple wrapper around
__register_one_node(). To simplify the code, register_one_node() has been
removed, and __register_one_node() has been renamed to
register_one_node().
Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
context argument
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
from register_one_node
register_one_node() is now only called via cpu_up() →
__try_online_node() during CPU hotplug operations to online a node.
At this stage, the node has not yet had any memory added. As a result,
there are no memory blocks to walk or register, so calling
register_memory_blocks_under_node() is unnecessary.
Therefore, the call to register_memory_blocks_under_node() has been
removed from register_one_node().
Link: https://lkml.kernel.org/r/ecf07075b1a41015fcf58823997d5c2ed7b8c18f.1748452242.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "drivers/base/node.c: optimization and cleanups", v7.
This patch (of 7)
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node's start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a `memory
block` may be absent.
As a result, `memory blocks` that fall between a node's start and end PFNs
may span across multiple nodes, and some sections within those blocks may
be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
for_each_online_node(nid):
start_pfn = pgdat->node_start_pfn;
end_pfn = pgdat->node_start_pfn + node_spanned_pages;
for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
mem_blk = memory_block_id(pfn_to_section_nr(pfn));
pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
pfn_mb_end = pfn_start + memory_block_pfns - 1
for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
if (get_nid_for_pfn(pfn) != nid):
continue;
else
do_register_memory_block_under_node(nid, mem_blk,
MEMINIT_EARLY);
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each `memory
block` in this range, we inspect all PFNs it contains and check their
associated NUMA node ID. If a PFN within the block matches the current
node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match
is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the
additional overhead of the binary search increases the execution time
significantly, potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the `memory
blocks` that belong to the current NUMA node. `memblock regions` are
contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on
the region's start and end PFNs. We then register all the memory blocks
within that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
[donettom@linux.ibm.com: restore removed extra line]
Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag. All other checks are thuse
dead code and the field can be removed.
Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swap_writeout only needs the swap_iocb cookie from the writeback_control
structure, so pass it explicitly.
Link: https://lkml.kernel.org/r/20250610054959.2057526-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
shmem_writeout only needs the swap_iocb cookie and the split folio list.
Pass those explicitly and remove the now unused list member from struct
writeback_control.
Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The config is in fact an additional upper limit of pageblock_order, so
rename it to avoid confusion.
Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Juan Yescas <jyescas@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Especially once we hit one of the assertions in
sanity_check_pinned_pages(), observing follow-up assertions failing in
other code can give good clues about what went wrong, so use
VM_WARN_ON_ONCE instead.
While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well.
Add one comment for the pfn_valid() check.
We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly.
Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns
something > 0 we're in bigger trouble. Convert the other BUG_ON's into
VM_WARN_ON_ONCE as well, they are in a similar domain "should never
happen", but more reasonable to check for during early testing.
[david@redhat.com: use the _FOLIO variant where possible, per Lorenzo]
Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com
Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If a user wishes to enable KSM mergeability for an entire process and all
fork/exec'd processes that come after it, they use the prctl()
PR_SET_MEMORY_MERGE operation.
This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
(in order to indicate they are KSM mergeable), as well as setting this
flag for all existing VMAs and propagating this across fork/exec.
However it also breaks VMA merging for new VMAs, both in the process and
all forked (and fork/exec'd) child processes.
This is because when a new mapping is proposed, the flags specified will
never have VM_MERGEABLE set. However all adjacent VMAs will already have
VM_MERGEABLE set, rendering VMAs unmergeable by default.
To work around this, we try to set the VM_MERGEABLE flag prior to
attempting a merge. In the case of brk() this can always be done.
However on mmap() things are more complicated - while KSM is not supported
for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE
file-backed mappings.
These mappings may have deprecated .mmap() callbacks specified which
could, in theory, adjust flags and thus KSM eligibility.
So we check to determine whether this is possible. If not, we set
VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
previous behaviour.
This fixes VMA merging for all new anonymous mappings, which covers the
majority of real-world cases, so we should see a significant improvement
in VMA mergeability.
For MAP_PRIVATE file-backed mappings, those which implement the
.mmap_prepare() hook and shmem are both known to be safe, so we allow
these, disallowing all other cases.
Also add stubs for newly introduced function invocations to VMA userland
testing.
[lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook]
Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local
Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com
Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport!
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3.
When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this
defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes
them available to KSM for samepage merging. It also sets VM_MERGEABLE in
all existing VMAs.
However this causes an issue upon mapping of new VMAs - the initial flags
will never have VM_MERGEABLE set when attempting a merge with adjacent
VMAs (this is set later in the mmap() logic), and adjacent VMAs will
ALWAYS have VM_MERGEABLE set.
This renders all newly mapped VMAs unmergeable.
To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far
earlier in the mmap() logic, prior to the merge being attempted.
However we run into complexity with the depreciated .mmap() callback - if
a driver hooks this, it might change flags which adjust KSM merge
eligibility.
We have to worry about this because, while KSM is only applicable to
private mappings, this includes both anonymous and MAP_PRIVATE-mapped
file-backed mappings.
This isn't a problem for brk(), where the VMA must be anonymous. However
in mmap() we must be conservative - if the VMA is anonymous then we can
always proceed, however if not, we permit only shmem mappings (whose .mmap
hook does not affect KSM eligibility) and drivers which implement
.mmap_prepare() (invoked prior to the KSM eligibility check).
If we can't be sure of the driver changing things, then we maintain the
same behaviour of performing the KSM check later in the mmap() logic (and
thus losing new VMA mergeability).
A great many use-cases for this logic will use anonymous mappings any
rate, so this change should already cover the majority of actual KSM
use-cases.
This patch (of 4):
In subsequent commits we are going to determine KSM eligibility prior to a
VMA being constructed, at which point we will of course not yet have
access to a VMA pointer.
It is trivial to boil down the check logic to be parameterised on
mm_struct, file and VMA flags, so do so.
As a part of this change, additionally expose and use file_is_dax() to
determine whether a file is being mapped under a DAX inode.
Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xu Xin <xu.xin16@zte.com.cn>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Stefan Roesch <shr@devkernel.io>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.
This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.
CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p
# p exit:
exit_signals(p);
p->flags |= PF_EXITING
exit_mm
p->mm = NULL;
migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.
Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter"). Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy. Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion. In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.
In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters. The
following test also confirm the theoretical analysis.
I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface. From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.
w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec
w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec
On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.
Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually
capable to operate in simplified MII mode, without TXER, RXER, CRS and
COL signals as defined for the MII. The absence of COL signal makes
half-duplex link modes impossible, however, the BroadR-Reach modes are
all full-duplex only.
Depending on the IC encapsulation, there exist MII-Lite-only PHYs such
as bcm54811 in MLP. The PHY itself is hardware-strapped to select among
multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also
activated by software.
Add MII-Lite activation for bcm5481x PHYs.
Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some Broadcom PHYs are capable to operate in simplified MII mode,
without TXER, RXER, CRS and COL signals as defined for the MII.
The MII-Lite mode can be used on most Ethernet controllers with full
MII interface by just leaving the input signals (RXER, CRS, COL)
inactive. The absence of COL signal makes half-duplex link modes
impossible but does not interfere with BroadR-Reach link modes on
Broadcom PHYs, because they are all full-duplex only.
Add MII-Lite interface mode, especially for Broadcom two-wire PHYs.
Signed-off-by: Kamil Horák - 2N <kamilh@axis.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250708090140.61355-2-kamilh@axis.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that there are 2 monitors for real-time applications, users may want to
enable both of them simultaneously. Make the number of per-task monitor
configurable. Default it to 2 for now.
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
While attempting to implement DA monitors for some complex specifications,
deterministic automaton is found to be inappropriate as the specification
language. The automaton is complicated, hard to understand, and
error-prone.
For these cases, linear temporal logic is more suitable as the
specification language.
Add support for linear temporal logic runtime verification monitor.
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Each RV monitor has one static buffer to send to the reactors. If multiple
errors are detected simultaneously, the one buffer could be overwritten.
Instead, leave it to the reactors to handle buffering.
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
vpanic() is useful for implementing runtime verification reactors. Add it.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
vprintk_deferred() is useful for implementing runtime verification
reactors. Make it public.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
EEVDF expects the scheduler to allocate a time quantum to the selected
entity and then pick a new entity for next quantum.
Although this notion of time quantum is not strictly doable in our case,
we can ensure a minimum runtime for each task most of the time and pick a
new entity after a minimum time has elapsed.
Reuse the slice protection of run to parity to ensure such runtime
quantum.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org
|
|
Chris reported that commit 5f6bd380c7bd ("sched/rt: Remove default
bandwidth control") caused a significant dip in his favourite
benchmark of the day. Simply disabling dl_server cured things.
His workload hammers the 0->1, 1->0 transitions, and the
dl_server_{start,stop}() overhead kills it -- fairly obviously a bad
idea in hind sight and all that.
Change things around to only disable the dl_server when there has not
been a fair task around for a whole period. Since the default period
is 1 second, this ensures the benchmark never trips this, overhead
gone.
Fixes: 557a6bfc662c ("sched/fair: Add trivial fair server")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250702121158.465086194@infradead.org
|
|
Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
pressure spikes from aggregation race") caused a regression for him on
a high context switch rate benchmark (schbench) due to the now
repeating cpu_clock() calls.
In particular the problem is that get_recent_times() will extrapolate
the current state to 'now'. But if an update uses a timestamp from
before the start of the update, it is possible to get two reads
with inconsistent results. It is effectively back-dating an update.
(note that this all hard-relies on the clock being synchronized across
CPUs -- if this is not the case, all bets are off).
Combine this problem with the fact that there are per-group-per-cpu
seqcounts, the commit in question pushed the clock read into the group
iteration, causing tree-depth cpu_clock() calls. On architectures
where cpu_clock() has appreciable overhead, this hurts.
Instead move to a per-cpu seqcount, which allows us to have a single
clock read for all group updates, increasing internal consistency and
lowering update overhead. This comes at the cost of a longer update
side (proportional to the tree depth) which can cause the read side to
retry more often.
Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
|
|
typechecking is up to users, anyway...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
All users outside of fs/debugfs/file.c are gone, in there we can just
fully split the wrappers for full and short cases and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Powering-off a genpd that was on during boot, before all of its consumer
devices have been probed, is certainly prone to problems.
As a step to improve this situation, let's prevent these genpds from being
powered-off until genpd_power_off_unused() gets called, which is a
late_initcall_sync().
Note that, this still doesn't guarantee that all the consumer devices has
been probed before we allow to power-off the genpds. Yet, this should be a
step in the right direction.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-22-ulf.hansson@linaro.org
|
|
This can be used by frameworks to set the sync_state() helper functions
for drivers that don't already have them set.
Signed-off-by: Saravana Kannan <saravanak@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-20-ulf.hansson@linaro.org
|
|
If the genpd provider's fwnode doesn't have an associated struct device
with it, we can make use of the generic genpd->dev and it corresponding
driver internally in genpd to manage ->sync_state().
More precisely, while adding a genpd OF provider let's check if the fwnode
has a device and if not, make the preparation to handle ->sync_state()
internally through the genpd_provider_driver and the genpd_provider_bus.
Note that, genpd providers may opt out from this behaviour by setting the
GENPD_FLAG_NO_SYNC_STATE config options for the genpds in question.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-19-ulf.hansson@linaro.org
|
|
It has turned out get_dev_from_fwnode() is useful at a few other places
outside of the driver core, as in gpiolib.c for example. Therefore let's
make it available as a common helper function.
Suggested-by: Saravana Kannan <saravanak@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-18-ulf.hansson@linaro.org
|
|
As there no longer any users outside the zynqmp firmware driver of
zynqmp_pm_init_finalize(), let's turn into a local static function.
Cc: Michal Simek <michal.simek@amd.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-16-ulf.hansson@linaro.org
|
|
Before we can implement the common ->sync_state() support in genpd, we need
to allow a few specific genpd providers to opt out from the new behaviour.
Let's introduce GENPD_FLAG_NO_SYNC_STATE as a new genpd config option, to
allow providers to opt out.
Suggested-by: Saravana Kannan <saravanak@google.com>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-9-ulf.hansson@linaro.org
|
|
In some cases the typical platform driver that act as genpd provider, may
need its own ->sync_state() callback to manage various things. In this
regards, the provider most likely wants to allow its corresponding genpds
to be powered-off.
For this reason, let's introduce a new genpd helper function,
of_genpd_sync_state() that helps genpd provider drivers to achieve this.
Suggested-by: Saravana Kannan <saravanak@google.com>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Tested-by: Hiago De Franco <hiago.franco@toradex.com> # Colibri iMX8X
Tested-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com> # TI AM62A,Xilinx ZynqMP ZCU106
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20250701114733.636510-8-ulf.hansson@linaro.org
|
|
This will add usb_alloc_noncoherent() and usb_free_noncoherent()
functions to support alloc and free buffer in a dma-noncoherent way.
To explicit manage the memory ownership for the kernel and device,
this will also add usb_dma_noncoherent_sync_for_cpu/device() functions
and call it at proper time. The management requires the user save
sg_table returned by usb_alloc_noncoherent() to urb->sgt.
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/r/20250704095751.73765-2-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Implement dot11ExtendedRegInfoSupport to advertise non-AP station
regulatory power capability as part of regulatory connectivity
element in (Re)Association request frames so that AP can achieve
maximum client connectivity. Control field which was interpreted
using value of 3-bits B5 to B3, now uses value of 4-bits B6 to B3 to
interpret the type of AP. Hence update IEEE80211_HE_6GHZ_OPER_CTRL_REG_INFO
to parse 4-bits control field. If older AP still updates only 3-bits
value of control field, station can still interpret the value as per
section E.2.7 of IEEE 802.11 REVme D7.0 and support the appropriate
AP type.
Also update IEEE80211_6GHZ_CTRL_REG_INDOOR_SP_AP as the value of
standard power AP is changed to 8 instead of 4 so that AP can support both
LPI AP and SP AP to maximize the connectivity with stations. For backward
compatibility, keeping value 4 as old AP by limiting it to SP AP only.
Signed-off-by: Somashekhar Puttagangaiah <somashekhar.puttagangaiah@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213232.90cdef116aad.I85da390fbee59355e3855691933e6a5e55c47ac4@changeid
[fix kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
We verify that the Extended MLD Capabilities are matching between links.
However, some bits are reserved and in particular the Recommended Max
Links subfield may not necessarily match. So only verify the known
subfields that can reliably be expected to be the same. More information
can be found in Table 9-417o, in IEEE P802.11be/D7.0.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20250609213231.a2fad48dd3e6.Iae1740cd2ac833bc4a64fd2af718e1485158fd42@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
to pick up the __GENMASK() fix, otherwise the AUX clock VDSO patches fail
to compile for compat.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Runtime PM updates related to autosuspend for 6.17
Make several autosuspend functions mark last busy stamp and update
the documentation accordingly (Sakari Ailus).
|
|
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock().
In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(),
and __in6_dev_get() require RTNL.
__dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be
converted to RCU version.
Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST.
setsockopt_needs_rtnl() will be removed in the next patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:
====================
mlx5-next updates 2025-07-08
The following pull-request contains common mlx5 updates
for your *net-next* tree.
v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Check device memory pointer before usage
net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
net/mlx5: Add IFC bits for PCIe Congestion Event object
net/mlx5: Small refactor for general object capabilities
net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
====================
Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Propagate find_random_bit() to cpumask API.
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
|
|
Generalize node_random() and make it available to general bitmaps and
cpumasks users.
Notice, find_first_bit() is generally faster than find_nth_bit(), and we
employ it when there's a single set bit in the bitmap.
See commit 3e061d924fe9c7b4 ("lib/nodemask: optimize node_random for
nodemask with single NUMA node").
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Yury Norov [NVIDIA]" <yury.norov@gmail.com>
|
|
Populate the gic_kvm_info struct based on support for
FEAT_GCIE_LEGACY. The struct is used by KVM to probe for a compatible
GIC.
Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Link: https://lore.kernel.org/r/20250627100847.1022515-3-sascha.bischoff@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
All drivers are now converted to dedicated _rxfh_context ops.
Remove the use of >set_rxfh() to manage additional contexts.
Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The GICv5 architecture implements the Interrupt Wire Bridge (IWB) in
order to support wired interrupts that cannot be connected directly
to an IRS and instead uses the ITS to translate a wire event into
an IRQ signal.
Add the wired-to-MSI IWB driver to manage IWB wired interrupts.
An IWB is connected to an ITS and it has its own deviceID for all
interrupt wires that it manages; the IWB input wire number must be
exposed to the ITS as an eventID with a 1:1 mapping.
This eventID is not programmable and therefore requires a new
msi_alloc_info_t flag to make sure the ITS driver does not allocate
an eventid for the wire but rather it uses the msi_alloc_info_t.hwirq
number to gather the ITS eventID.
Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Co-developed-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-29-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The GICv5 architecture implements Interrupt Translation Service
(ITS) components in order to translate events coming from peripherals
into interrupt events delivered to the connected IRSes.
Events (ie MSI memory writes to ITS translate frame), are translated
by the ITS using tables kept in memory.
ITS translation tables for peripherals is kept in memory storage
(device table [DT] and Interrupt Translation Table [ITT]) that
is allocated by the driver on boot.
Both tables can be 1- or 2-level; the structure is chosen by the
driver after probing the ITS HW parameters and checking the
allowed table splits and supported {device/event}_IDbits.
DT table entries are allocated on demand (ie when a device is
probed); the DT table is sized using the number of supported
deviceID bits in that that's a system design decision (ie the
number of deviceID bits implemented should reflect the number
of devices expected in a system) therefore it makes sense to
allocate a DT table that can cater for the maximum number of
devices.
DT and ITT tables are allocated using the kmalloc interface;
the allocation size may be smaller than a page or larger,
and must provide contiguous memory pages.
LPIs INTIDs backing the device events are allocated one-by-one
and only upon Linux IRQ allocation; this to avoid preallocating
a large number of LPIs to cover the HW device MSI vector
size whereas few MSI entries are actually enabled by a device.
ITS cacheability/shareability attributes are programmed
according to the provided firmware ITS description.
The GICv5 partially reuses the GICv3 ITS MSI parent infrastructure
and adds functions required to retrieve the ITS translate frame
addresses out of msi-map and msi-parent properties to implement
the GICv5 ITS MSI parent callbacks.
Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Co-developed-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-28-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In some irqchip implementations the fwnode representing the IRQdomain
and the MSI controller fwnode do not match; in particular the IRQdomain
fwnode is the MSI controller fwnode parent.
To support selecting such IRQ domains, add a flag in core IRQ domain
code that explicitly tells the MSI lib to use the parent fwnode while
carrying out IRQ domain selection.
Update the msi-lib select callback with the resulting logic.
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-27-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
IRQchip drivers need a PCI/MSI function to map a RID to a MSI
controller deviceID namespace and at the same time retrieve the
struct device_node pointer of the MSI controller the RID is mapped
to.
Add pci_msi_map_rid_ctlr_node() to achieve this purpose.
Cc Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-25-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add an of_msi_xlate() helper that maps a device ID and returns
the device node of the MSI controller the device ID is mapped to.
Required by core functions that need an MSI controller device node
pointer at the same time as a mapped device ID, of_msi_map_id() is not
sufficient for that purpose.
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-24-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
An IRS supports Logical Peripheral Interrupts (LPIs) and implement
Linux IPIs on top of it.
LPIs are used for interrupt signals that are translated by a
GICv5 ITS (Interrupt Translation Service) but also for software
generated IRQs - namely interrupts that are not driven by a HW
signal, ie IPIs.
LPIs rely on memory storage for interrupt routing and state.
LPIs state and routing information is kept in the Interrupt
State Table (IST).
IRSes provide support for 1- or 2-level IST tables configured
to support a maximum number of interrupts that depend on the
OS configuration and the HW capabilities.
On systems that provide 2-level IST support, always allow
the maximum number of LPIs; On systems with only 1-level
support, limit the number of LPIs to 2^12 to prevent
wasting memory (presumably a system that supports a 1-level
only IST is not expecting a large number of interrupts).
On a 2-level IST system, L2 entries are allocated on
demand.
The IST table memory is allocated using the kmalloc() interface;
the allocation required may be smaller than a page and must be
made up of contiguous physical pages if larger than a page.
On systems where the IRS is not cache-coherent with the CPUs,
cache mainteinance operations are executed to clean and
invalidate the allocated memory to the point of coherency
making it visible to the IRS components.
On GICv5 systems, IPIs are implemented using LPIs.
Add an LPI IRQ domain and implement an IPI-specific IRQ domain created
as a child/subdomain of the LPI domain to allocate the required number
of LPIs needed to implement the IPIs.
IPIs are backed by LPIs, add LPIs allocation/de-allocation
functions.
The LPI INTID namespace is managed using an IDA to alloc/free LPI INTIDs.
Associate an IPI irqchip with IPI IRQ descriptors to provide
core code with the irqchip.ipi_send_single() method required
to raise an IPI.
Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Co-developed-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-22-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The GICv5 Interrupt Routing Service (IRS) component implements
interrupt management and routing in the GICv5 architecture.
A GICv5 system comprises one or more IRSes, that together
handle the interrupt routing and state for the system.
An IRS supports Shared Peripheral Interrupts (SPIs), that are
interrupt sources directly connected to the IRS; they do not
rely on memory for storage. The number of supported SPIs is
fixed for a given implementation and can be probed through IRS
IDR registers.
SPI interrupt state and routing are managed through GICv5
instructions.
Each core (PE in GICv5 terms) in a GICv5 system is identified with
an Interrupt AFFinity ID (IAFFID).
An IRS manages a set of cores that are connected to it.
Firmware provides a topology description that the driver uses
to detect to which IRS a CPU (ie an IAFFID) is associated with.
Use probeable information and firmware description to initialize
the IRSes and implement GICv5 IRS SPIs support through an
SPI-specific IRQ domain.
The GICv5 IRS driver:
- Probes IRSes in the system to detect SPI ranges
- Associates an IRS with a set of cores connected to it
- Adds an IRQchip structure for SPI handling
SPIs priority is set to a value corresponding to the lowest
permissible priority in the system (taking into account the
implemented priority bits of the IRS and CPU interface).
Since all IRQs are set to the same priority value, the value
itself does not matter as long as it is a valid one.
Co-developed-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Co-developed-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250703-gicv5-host-v7-21-12e71f1b3528@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|