Age | Commit message (Collapse) | Author |
|
Test the basic functionality of the private hash:
- Upon start, with no threads there is no private hash.
- The first thread initializes the private hash.
- More than four threads will increase the size of the private hash if
the system has more than 16 CPUs online.
- Once the user sets the size of private hash, auto scaling is disabled.
- The user is only allowed to use numbers to the power of two.
- The user may request the global or make the hash immutable.
- Once the global hash has been set or the hash has been made immutable,
further changes are not allowed.
- Futex operations should work the whole time. It must be possible to
hold a lock, such a PI initialised mutex, during the resize operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-21-bigeasy@linutronix.de
|
|
Make it build without relying on recent headers.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Add the -b/ --buckets argument to specify the number of hash buckets for
the private futex hash. This is directly passed to
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, buckets, immutable)
and must return without an error if specified. The `immutable' is 0 by
default and can be set to 1 via the -I/ --immutable argument.
The size of the private hash is verified with PR_FUTEX_HASH_GET_SLOTS.
If PR_FUTEX_HASH_GET_SLOTS failed then it is assumed that an older
kernel was used without the support and that the global hash is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-20-bigeasy@linutronix.de
|
|
Synchronize prctl.h with current uapi version after adding
PR_FUTEX_HASH.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-19-bigeasy@linutronix.de
|
|
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write
the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-18-bigeasy@linutronix.de
|
|
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de
|
|
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same
amount of buckets in the private hash as used by the global hash then
the operations/sec were about the same.
This changed once the private hash became resizable. This feature added
an RCU section and reference counting via atomic inc+dec operation into
the hot path.
The reference counting can be avoided if the private hash is made
immutable.
Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
private should be made immutable. Once set (to true) the a further
resize is not allowed (same if set to global hash).
Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
be changed.
Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s":
- Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs:
- Before the introducing task local hash
shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10
private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series
shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10
-b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10
-Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10
-b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10
-Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10
128 is the default allocation of hash buckets.
8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs:
- Before the introducing task local hash
shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20
private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series
shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20
-b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20
-Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20
-b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20
-Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20
1024 is the default allocation of hash buckets.
65536 was the previous amount of allocated hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de
|
|
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.
The private hash allocated at thread creation is changed from 16 to
16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de
|
|
Allocate a private futex hash with 16 slots if a task forks its first
thread.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-14-bigeasy@linutronix.de
|
|
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
|
|
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-12-bigeasy@linutronix.de
|
|
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
|
|
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
|
|
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is
dropped later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-9-bigeasy@linutronix.de
|
|
This gets us:
fph = futex_private_hash(key) /* gets fph and inc users */
futex_private_hash_get(fph) /* inc users */
futex_private_hash_put(fph) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-8-bigeasy@linutronix.de
|
|
This gets us:
hb = futex_hash(key) /* gets hb and inc users */
futex_hash_get(hb) /* inc users */
futex_hash_put(hb) /* dec users */
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-7-bigeasy@linutronix.de
|
|
Create explicit scopes for hb variables; almost pure re-indent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-6-bigeasy@linutronix.de
|
|
Getting the hash bucket and queuing it are two distinct actions. In
light of wanting to add a put hash bucket function later, untangle
them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
|
|
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().
Mostly such that requeue can have an extra test in between.
Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().
[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
|
|
To enable node specific hash-tables using huge pages if possible.
[bigeasy: use __vmalloc_node_range_noprof(), add nommu bits, inline
vmalloc_huge]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250416162921.513656-3-bigeasy@linutronix.de
|
|
rcuref_read() returns the number of references that are currently held.
If 0 is returned then it is not safe to assume that the object ca be
scheduled for deconstruction because it is marked DEAD. This happens if
the return value of rcuref_put() is ignored and assumptions are made.
If 0 is returned then the counter transitioned from 0 to RCUREF_NOREF.
If rcuref_put() did not return to the caller then the counter did not
yet transition from RCUREF_NOREF to RCUREF_DEAD. This means that there
is still a chance that the counter will transition from RCUREF_NOREF to
0 meaning it is still valid and must not be deconstructed. In this brief
window rcuref_read() will return 0.
Provide rcuref_is_dead() to determine if the counter is marked as
RCUREF_DEAD.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-2-bigeasy@linutronix.de
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fixes from Bjorn Helgaas:
- When releasing a start-aligned resource, e.g., a bridge window, save
start/end/flags for the next assignment attempt; fixes a v6.15-rc1
regression (Ilpo Järvinen)
- Move set_pcie_speed.sh from TEST_PROGS to TEST_FILE; fixes a bwctrl
selftest v6.15-rc1 regression (Ilpo Järvinen)
- Add Manivannan Sadhasivam as maintainer of native host bridge and
endpoint drivers (Manivannan Sadhasivam)
- In endpoint test driver, defer IRQ allocation from .probe() until
ioctl() to fix a regression on platforms where the Vendor/Device ID
match doesn't include driver_data (Niklas Cassel)
* tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
misc: pci_endpoint_test: Defer IRQ allocation until ioctl(PCITEST_SET_IRQTYPE)
MAINTAINERS: Move Manivannan Sadhasivam as PCI Native host bridge and endpoint maintainer
selftests/pcie_bwctrl: Fix test progs list
PCI: Restore assigned resources fully after release
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Revert a v6.15 patch due to a report of SELinux test failures
* tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "sunrpc: clean cache_detail immediately when flush is written frequently"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix 32-bit kernel boot crash if passed physical memory with more than
32 address bits
- Fix Xen PV crash
- Work around build bug in certain limited build environments
- Fix CTEST instruction decoding in insn_decoder_test
* tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Fix CTEST instruction decoding
x86/boot: Work around broken busybox 'truncate' tool
x86/mm: Fix _pgd_alloc() for Xen PV mode
x86/e820: Discard high memory that can't be addressed by 32-bit systems
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix sporadic crashes in dequeue_entities() due to ... bad math.
[ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
complex math being correct it could have de-escalated a crash into
a warning, but that's for a different patch ]"
* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc perf events fixes from Ingo Molnar:
- Use POLLERR for events in error state, instead of the ambiguous
POLLHUP error value
- Fix non-sampling (counting) events on certain x86 platforms
* tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix non-sampling (counting) events on certain x86 platforms
perf/core: Change to POLLERR for pinned events with error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Fix crashes in the gic-v2m irqchip driver, caused by an incorrect
__init annotation"
* tag 'irq-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Add a missing Kconfig option, fix some bugs in exception handlers,
memory management and KVM"
* tag 'loongarch-fixes-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally
LoongArch: KVM: Fully clear some CSRs when VM reboot
LoongArch: KVM: Fix multiple typos of KVM code
LoongArch: Return NULL from huge_pte_offset() for invalid PMD
LoongArch: Remove a bogus reference to ZONE_DMA
LoongArch: Handle fp, lsx, lasx and lbt assembly symbols
LoongArch: Make do_xyz() exception handlers more robust
LoongArch: Make regs_irqs_disabled() more clear
LoongArch: Select ARCH_USE_MEMTEST
|
|
Pull OpenRISC updates from Stafford Horne:
- Support for cacheinfo API to expose OpenRISC cache info via sysfs,
this also translated to some cleanups to OpenRISC cache flush and
invalidate API's
- Documentation updates for new mailing list and toolchain binaries
* tag 'for-linus' of https://github.com/openrisc/linux:
Documentation: openrisc: Update toolchain binaries URL
Documentation: openrisc: Update mailing list
openrisc: Add cacheinfo support
openrisc: Introduce new utility functions to flush and invalidate caches
openrisc: Refactor struct cpuinfo_or1k to reduce duplication
|
|
Ondrej reports that certain SELinux tests are failing after commit
fc2a169c56de ("sunrpc: clean cache_detail immediately when flush is
written frequently"), merged during the v6.15 merge window.
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: fc2a169c56de ("sunrpc: clean cache_detail immediately when flush is written frequently")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kunit fix from Kees Cook:
"A single fix for the kunit lib/tests/ relocation:
- Ensure prime numbers tests are included in KUnit test runs (Mark Brown)"
* tag 'move-lib-kunit-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lib: Ensure prime numbers tests are included in KUnit test runs
|
|
Pull drm fixes from Dave Airlie:
"Weekly drm fixes, mostly amdgpu, with some exynos cleanups and a
couple of minor fixes, seems a bit quiet, but probably some lag from
Easter holidays.
amdgpu:
- P2P DMA fixes
- Display reset fixes
- DCN 3.5 fixes
- ACPI EDID fix
- LTTPR fix
- mode_valid() fix
exynos:
- fix spelling error
- remove redundant error handling in exynos_drm_vidi.c module
- marks struct decon_data as const in the exynos7_drm_decon driver
since it is only read
- Remove unnecessary checking in exynos_drm_drv.c module
meson:
- Fix VCLK calculation
panel:
- jd9365a: Fix reset polarity"
* tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel:
drm/exynos: Fix spelling mistake "enqueu" -> "enqueue"
drm/exynos: exynos7_drm_decon: Consstify struct decon_data
drm/exynos: fixed a spelling error
drm/exynos/vidi: Remove redundant error handling in vidi_get_modes()
drm/exynos: Remove unnecessary checking
drm/amd/display: do not copy invalid CRTC timing info
drm/amd/display: Default IPS to RCG_IN_ACTIVE_IPS2_IN_OFF
drm/amd/display: Use 16ms AUX read interval for LTTPR with old sinks
drm/amd/display: Fix ACPI edid parsing on some Lenovo systems
drm/amdgpu: Allow P2P access through XGMI
drm/amd/display: Enable urgent latency adjustment on DCN35
drm/amd/display: Force full update in gpu reset
drm/amd/display: Fix gpu reset in multidisplay config
drm/amdgpu: Don't pin VRAM without DMABUF_MOVE_NOTIFY
drm/amdgpu: Use allowed_domains for pinning dmabufs
drm: panel: jd9365da: fix reset signal polarity in unprepare
drm/meson: use unsigned long long / Hz for frequency types
Revert "drm/meson: vclk: fix calculation of 59.94 fractional rates"
|
|
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:
1. In the if (entity_is_task(se)) else block at the beginning of
dequeue_entities(), slice is set to
cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
the saved slice, which is still U64_MAX.
This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:
6. In update_entity_lag(), se->slice is used to calculate limit, which
ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
is negative, vlag > limit, so se->vlag is set to the same huge
negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
incorrectly returns false because the vruntime is so far from the
other vruntimes on the queue, causing the
(vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
pick_eevdf() and crashes.
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.
Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
|
|
With ACPI in place, gicv2m_get_fwnode() is registered with the pci
subsystem as pci_msi_get_fwnode_cb(), which may get invoked at runtime
during a PCI host bridge probe. But, the call back is wrongly marked as
__init, causing it to be freed, while being registered with the PCI
subsystem and could trigger:
Unable to handle kernel paging request at virtual address ffff8000816c0400
gicv2m_get_fwnode+0x0/0x58 (P)
pci_set_bus_msi_domain+0x74/0x88
pci_register_host_bridge+0x194/0x548
This is easily reproducible on a Juno board with ACPI boot.
Retain the function for later use.
Fixes: 0644b3daca28 ("irqchip/gic-v2m: acpi: Introducing GICv2m ACPI support")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
In function kvm_pre_enter_guest(), it prepares to enter guest and check
whether there are pending signals or events. And it will not enter guest
if there are, PMU pass-through preparation for guest should be cancelled
and host should own PMU hardware.
Cc: stable@vger.kernel.org
Fixes: f4e40ea9f78f ("LoongArch: KVM: Add PMU support for guest")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Some registers such as LOONGARCH_CSR_ESTAT and LOONGARCH_CSR_GINTC are
partly cleared with function _kvm_setcsr(). This comes from the hardware
specification, some bits are read only in VM mode, and however they can
be written in host mode. So they are partly cleared in VM mode, and can
be fully cleared in host mode.
These read only bits show pending interrupt or exception status. When VM
reset, the read-only bits should be cleared, otherwise vCPU will receive
unknown interrupts in boot stage.
Here registers LOONGARCH_CSR_ESTAT/LOONGARCH_CSR_GINTC are fully cleared
in ioctl KVM_REG_LOONGARCH_VCPU_RESET vCPU reset path.
Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Fix multiple typos inside arch/loongarch/kvm.
Cc: stable@vger.kernel.org
Reviewed-by: Yuli Wang <wangyuli@uniontech.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Yulong Han <wheatfox17@icloud.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
LoongArch's huge_pte_offset() currently returns a pointer to a PMD slot
even if the underlying entry points to invalid_pte_table (indicating no
mapping). Callers like smaps_hugetlb_range() fetch this invalid entry
value (the address of invalid_pte_table) via this pointer.
The generic is_swap_pte() check then incorrectly identifies this address
as a swap entry on LoongArch, because it satisfies the "!pte_present()
&& !pte_none()" conditions. This misinterpretation, combined with a
coincidental match by is_migration_entry() on the address bits, leads to
kernel crashes in pfn_swap_entry_to_page().
Fix this at the architecture level by modifying huge_pte_offset() to
check the PMD entry's content using pmd_none() before returning. If the
entry is invalid (i.e., it points to invalid_pte_table), return NULL
instead of the pointer to the slot.
Cc: stable@vger.kernel.org
Acked-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
Signed-off-by: Ming Wang <wangming01@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Remove dead code. LoongArch does not have a DMA memory zone (24bit DMA).
The architecture does not even define MAX_DMA_PFN.
Cc: stable@vger.kernel.org
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Like the other relevant symbols, export some fp, lsx, lasx and lbt
assembly symbols and put the function declarations in header files
rather than source files.
While at it, use "asmlinkage" for the other existing C prototypes
of assembly functions and also do not use the "extern" keyword with
function declarations according to the document coding-style.rst.
Cc: stable@vger.kernel.org # 6.6+
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Currently, interrupts need to be disabled before single-step mode is
set, it requires that CSR_PRMD_PIE be cleared in save_local_irqflag()
which is called by setup_singlestep(), this is reasonable.
But in the first kprobe breakpoint exception, if the irq is enabled at
the beginning of do_bp(), it will not be disabled at the end of do_bp()
due to the CSR_PRMD_PIE has been cleared in save_local_irqflag(). So for
this case, it may corrupt exception context when restoring the exception
after do_bp() in handle_bp(), this is not reasonable.
In order to restore exception safely in handle_bp(), it needs to ensure
the irq is disabled at the end of do_bp(), so just add a local variable
to record the original interrupt status in the parent context, then use
it as the check condition to enable and disable irq in do_bp().
While at it, do the similar thing for other do_xyz() exception handlers
to make them more robust.
Fixes: 6d4cc40fb5f5 ("LoongArch: Add kprobes support")
Suggested-by: Jinyang He <hejinyang@loongson.cn>
Suggested-by: Huacai Chen <chenhuacai@loongson.cn>
Co-developed-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
In the current code, the definition of regs_irqs_disabled() is actually
"!(regs->csr_prmd & CSR_CRMD_IE)" because arch_irqs_disabled_flags() is
defined as "!(flags & CSR_CRMD_IE)", it looks a little strange.
Define regs_irqs_disabled() as !(regs->csr_prmd & CSR_PRMD_PIE) directly
to make it more clear, no functional change.
While at it, the return value of regs_irqs_disabled() is true or false,
so change its type to reflect that and also make it always inline.
Fixes: 803b0fc5c3f2 ("LoongArch: Add process management")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
As of commit dce44566192e ("mm/memtest: add ARCH_USE_MEMTEST"),
architectures must select ARCH_USE_MEMTESET to enable CONFIG_MEMTEST.
Commit 628c3bb40e9a ("LoongArch: Add boot and setup routines") added
support for early_memtest but did not select ARCH_USE_MEMTESET.
Fixes: 628c3bb40e9a ("LoongArch: Add boot and setup routines")
Tested-by: Erpeng Xu <xuerpeng@uniontech.com>
Tested-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Add namespace to BPF internal symbols (Alexei Starovoitov)
- Fix possible endless loop in BPF map iteration (Brandon Kammerdiener)
- Fix compilation failure for samples/bpf on LoongArch (Haoran Jiang)
- Disable a part of sockmap_ktls test (Ihor Solodrai)
- Correct typo in __clang_major__ macro (Peilin Ye)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Correct typo in __clang_major__ macro
samples/bpf: Fix compilation failure for samples/bpf on LoongArch Fedora
bpf: Add namespace to BPF internal symbols
selftests/bpf: add test for softlock when modifying hashmap while iterating
bpf: fix possible endless loop in BPF map iteration
selftests/bpf: Mitigate sockmap_ktls disconnect_after_delete failure
|
|
Make sure that CAN_USE_BPF_ST test (compute_live_registers/store) is
enabled when __clang_major__ >= 18.
Fixes: 2ea8f6a1cda7 ("selftests/bpf: test cases for compute_live_registers()")
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/20250425213712.1542077-1-yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Fix the incorrect return type of ata_mselect_control_ata_feature()
- Several fixes for the control of the Command Duration Limits feature
to avoid unnecessary enable and disable actions. Avoiding the
unnecessary enable action also avoids unwanted resets of the CDL
statistics log page as that is implied for any enable action.
- Fix the translation for sensing the control mode page to correctly
return the last enable or disable action performed, as defined in
SAT-6. This correct mode sense information is used to fix the
behavior of the scsi layer to avoid unnecessary mode select command
issuing.
* tag 'ata-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
scsi: Improve CDL control
ata: libata-scsi: Improve CDL control
ata: libata-scsi: Fix ata_msense_control_ata_feature()
ata: libata-scsi: Fix ata_mselect_control_ata_feature() return type
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- For some reason we went from zero to three maintainers for HFS/HFS+
in a matter of days. The lesson to learn from this might just be that
we need to threaten code removal more often!?
- Fix a regression introduced by enabling large folios for lage logical
block sizes. This has caused issues for noref migration with large
folios due to sleeping while in an atomic context.
New sleeping variants of pagecache lookup helpers are introduced.
These helpers take the folio lock instead of the mapping's private
spinlock. The problematic users are converted to the sleeping
variants and serialize against noref migration. Atomic users will
bail on seeing the new BH_Migrate flag.
This also shrinks the critical region of the mapping's private lock
and the new blocking callers reduce contention on the spinlock for
bdev mappings.
- Fix two bugs in do_move_mount() when with MOVE_MOUNT_BENEATH. The
first bug is using a mountpoint that is located on a mount we're not
holding a reference to. The second bug is putting the mountpoint
after we've called namespace_unlock() as it's no longer guaranteed
that it does stay a mountpoint.
- Remove a pointless call to vfs_getattr_nosec() in the devtmpfs code
just to query i_mode instead of simply querying the inode directly.
This also avoids lifetime issues for the dm code by an earlier bugfix
this cycle that moved bdev_statx() handling into vfs_getattr_nosec().
- Fix AT_FDCWD handling with getname_maybe_null() in the xattr code.
- Fix a performance regression for files when multiple callers issue a
close when it's not the last reference.
- Remove a duplicate noinline annotation from pipe_clear_nowait().
* tag 'vfs-6.15-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs/xattr: Fix handling of AT_FDCWD in setxattrat(2) and getxattrat(2)
MAINTAINERS: hfs/hfsplus: add myself as maintainer
splice: remove duplicate noinline from pipe_clear_nowait
devtmpfs: don't use vfs_getattr_nosec to query i_mode
fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount()
fs: fall back to file_ref_put() for non-last reference
mm/migrate: fix sleep in atomic for large folios and buffer heads
fs/ext4: use sleeping version of sb_find_get_block()
fs/jbd2: use sleeping version of __find_get_block()
fs/ocfs2: use sleeping version of __find_get_block()
fs/buffer: use sleeping version of __find_get_block()
fs/buffer: introduce sleeping flavors for pagecache lookups
MAINTAINERS: add HFS/HFS+ maintainers
fs/buffer: split locking for pagecache lookups
|
|
Pull ceph fixes from Ilya Dryomov:
"A small CephFS encryption-related fix and a dead code cleanup"
* tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client:
ceph: Fix incorrect flush end position calculation
ceph: Remove osd_client deadcode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
"The fixes address global persistent flush (GPF) changes and CXL
Features support changes that went in the 6.15 merge window. And also
a fix to an issue observed on CXL 1.1 platform during device
enumeration.
Summary:
- Fix using the wrong GPF DVSEC location:
- Fix caching of dport GPF DVSEC from the first endpoint
- Ensure that the GPF phase timeout is only updated once by first
endpoint
- Drop is_port parameter for cxl_gpf_get_dvsec()
- Fix the devm_* call host device for CXL fwctl setup
- Set the out_len in Set Features failure case
- Fix RCD initialization by skipping unneeded mem_en check"
* tag 'cxl-fixes-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/core/regs.c: Skip Memory Space Enable check for RCD and RCH Ports
cxl/feature: Update out_len in set feature failure case
cxl: Fix devm host device for CXL fwctl initialization
cxl/pci: Drop the parameter is_port of cxl_gpf_get_dvsec()
cxl/pci: Update Port GPF timeout only when the first EP attaching
cxl/core: Fix caching dport GPF DVSEC issue
|