Age | Commit message (Collapse) | Author |
|
The write() callback of legacy consoles usually makes use of
spinlocks. This is not permitted with PREEMPT_RT in atomic
contexts.
For PREEMPT_RT, create a new kthread to handle printing of all
the legacy consoles (and nbcon consoles if boot consoles are
registered). This allows legacy consoles to work on PREEMPT_RT
without requiring modification. (However they will not have
the reliability properties guaranteed by nbcon atomic
consoles.)
Use the existing printk_kthreads_check_locked() to start/stop
the legacy kthread as needed.
Introduce the macro force_legacy_kthread() to query if the
forced threading of legacy consoles is in effect. Although
currently only enabled for PREEMPT_RT, this acts as a simple
mechanism for the future to allow other preemption models to
easily take advantage of the non-interference property provided
by the legacy kthread.
When force_legacy_kthread() is true, the legacy kthread
fulfills the role of the console_flush_type @legacy_offload by
waking the legacy kthread instead of printing via the
console_lock in the irq_work. If the legacy kthread is not
yet available, no legacy printing takes place (unless in
panic).
If for some reason the legacy kthread fails to create, any
legacy consoles are unregistered. With force_legacy_kthread(),
the legacy kthread is a critical component for legacy consoles.
These changes only affect CONFIG_PREEMPT_RT.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-16-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
An emergency or panic context can takeover console ownership
while the current owner was printing a printk message. The
atomic printer will re-print the message that the previous
owner was printing. However, this can look confusing to the
user and may even seem as though a message was lost.
[3430014.1
[3430014.181123] usb 1-2: Product: USB Audio
Add a new field @nbcon_prev_seq to struct console to track
the sequence number to print that was assigned to the previous
console owner. If this matches the sequence number to print
that the current owner is assigned, then a takeover must have
occurred. In this case, print an additional message to inform
the user that the previous message is being printed again.
[3430014.1
** replaying previous printk message **
[3430014.181123] usb 1-2: Product: USB Audio
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-12-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
In order to support prepending different texts to printk
messages, split out the prepending code into a helper
function.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-11-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Once the kthread is running and available
(i.e. @printk_kthreads_running is set), the kthread becomes
responsible for flushing any pending messages which are added
in NBCON_PRIO_NORMAL context. Namely the legacy
console_flush_all() and device_release() no longer flush the
console. And nbcon_atomic_flush_pending() used by
nbcon_cpu_emergency_exit() no longer flushes messages added
after the emergency messages.
The console context is safe when used by the kthread only when
one of the following conditions are true:
1. Other caller acquires the console context with
NBCON_PRIO_NORMAL with preemption disabled. It will
release the context before rescheduling.
2. Other caller acquires the console context with
NBCON_PRIO_NORMAL under the device_lock.
3. The kthread is the only context which acquires the console
with NBCON_PRIO_NORMAL.
This is satisfied for all atomic printing call sites:
nbcon_legacy_emit_next_record() (#1)
nbcon_atomic_flush_pending_con() (#1)
nbcon_device_release() (#2)
It is even double guaranteed when @printk_kthreads_running
is set because then _only_ the kthread will print for
NBCON_PRIO_NORMAL. (#3)
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When printing via console_lock, the write_atomic() callback is
used for nbcon consoles. However, if it is known that the
current context is a task context, the write_thread() callback
can be used instead.
Using write_thread() instead of write_atomic() helps to reduce
large disabled preemption regions when the device_lock does not
disable preemption.
This is mainly a preparatory change to allow avoiding
write_atomic() completely during normal operation if boot
consoles are registered.
As a side-effect, it also allows consolidating the printing
code for legacy printing and the kthread printer.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-9-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Move nbcon_atomic_emit_one() so that it can be used by
nbcon_kthread_func() in a follow-up commit.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-8-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Provide the main implementation for running a printer kthread
per nbcon console that is takeover/handover aware. This
includes:
- new mandatory write_thread() callback
- kthread creation
- kthread main printing loop
- kthread wakeup mechanism
- kthread shutdown
kthread creation is a bit tricky because consoles may register
before kthreads can be created. In such cases, registration
will succeed, even though no kthread exists. Once kthreads can
be created, an early_initcall will set @printk_kthreads_ready.
If there are no registered boot consoles, the early_initcall
creates the kthreads for all registered nbcon consoles. If
kthread creation fails, the related console is unregistered.
If there are registered boot consoles when
@printk_kthreads_ready is set, no kthreads are created until
the final boot console unregisters.
Once kthread creation finally occurs, @printk_kthreads_running
is set so that the system knows kthreads are available for all
registered nbcon consoles.
If @printk_kthreads_running is already set when the console
is registering, the kthread is created during registration. If
kthread creation fails, the registration will fail.
Until @printk_kthreads_running is set, console printing occurs
directly via the console_lock.
kthread shutdown on system shutdown/reboot is necessary to
ensure the printer kthreads finish their printing so that the
system can cleanly transition back to direct printing via the
console_lock in order to reliably push out the final
shutdown/reboot messages. @printk_kthreads_running is cleared
before shutting down the individual kthreads.
The kthread uses a new mandatory write_thread() callback that
is called with both device_lock() and the console context
acquired.
The console ownership handling is necessary for synchronization
against write_atomic() which is synchronized only via the
console context ownership.
The device_lock() serializes acquiring the console context with
NBCON_PRIO_NORMAL. It is needed in case the device_lock() does
not disable preemption. It prevents the following race:
CPU0 CPU1
[ task A ]
nbcon_context_try_acquire()
# success with NORMAL prio
# .unsafe == false; // safe for takeover
[ schedule: task A -> B ]
WARN_ON()
nbcon_atomic_flush_pending()
nbcon_context_try_acquire()
# success with EMERGENCY prio
# flushing
nbcon_context_release()
# HERE: con->nbcon_state is free
# to take by anyone !!!
nbcon_context_try_acquire()
# success with NORMAL prio [ task B ]
[ schedule: task B -> A ]
nbcon_enter_unsafe()
nbcon_context_can_proceed()
BUG: nbcon_context_can_proceed() returns "true" because
the console is owned by a context on CPU0 with
NBCON_PRIO_NORMAL.
But it should return "false". The console is owned
by a context from task B and we do the check
in a context from task A.
Note that with these changes, the printer kthreads do not yet
take over full responsibility for nbcon printing during normal
operation. These changes only focus on the lifecycle of the
kthreads.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
When initializing an nbcon console, have nbcon_alloc() set
@nbcon_seq to the highest possible sequence number. For all
practical purposes, this will guarantee that the console
will have nothing to print until later when @nbcon_seq is
set to the proper initial printing value.
This will be particularly important once kthread printing is
introduced because nbcon_alloc() can create/start the kthread
before the desired initial sequence number is known.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The nbcon consoles will have two callbacks to be used for
different contexts. In order to determine if an nbcon console
is usable, console_is_usable() must know if it is a context
that will need to use the optional write_atomic() callback.
Also, nbcon_emit_next_record() must know which callback it
needs to call.
Add an extra parameter @use_atomic to console_is_usable() and
nbcon_emit_next_record() to specify this.
Since so far only the write_atomic() callback exists,
@use_atomic is set to true for all call sites.
For legacy consoles, @use_atomic is not used.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Ensure consoles have flushed pending records before
unregistering. The console should print up to at least its
related "console disabled" record.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
A follow-up change adds pr_flush() to console unregistration.
However, with boot consoles unregistration can happen very
early if there are also regular consoles registering as well.
In this case the pr_flush() is not important because all
consoles are flushed when checking the initial console sequence
number.
Allow pr_flush() to fail if @system_state has not yet reached
SYSTEM_SCHEDULING. This avoids might_sleep() and msleep()
explosions that would otherwise occur:
[ 0.436739][ T0] printk: legacy console [ttyS0] enabled
[ 0.439820][ T0] printk: legacy bootconsole [earlyser0] disabled
[ 0.446822][ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002
[ 0.450491][ T0] 1 lock held by swapper/0/0:
[ 0.457897][ T0] #0: ffffffff82ae5f88 (console_mutex){+.+.}-{4:4}, at: console_list_lock+0x20/0x70
[ 0.463141][ T0] Modules linked in:
[ 0.465307][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc1+ #372
[ 0.469394][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[ 0.474402][ T0] Call Trace:
[ 0.476246][ T0] <TASK>
[ 0.481473][ T0] dump_stack_lvl+0x93/0xb0
[ 0.483949][ T0] dump_stack+0x10/0x20
[ 0.486256][ T0] __schedule_bug+0x68/0x90
[ 0.488753][ T0] __schedule+0xb9b/0xd80
[ 0.491179][ T0] ? lock_release+0xb5/0x270
[ 0.493732][ T0] schedule+0x43/0x170
[ 0.495998][ T0] schedule_timeout+0xc5/0x1e0
[ 0.498634][ T0] ? __pfx_process_timeout+0x10/0x10
[ 0.501522][ T0] ? msleep+0x13/0x50
[ 0.503728][ T0] msleep+0x3c/0x50
[ 0.505847][ T0] __pr_flush.constprop.0.isra.0+0x56/0x500
[ 0.509050][ T0] ? _printk+0x58/0x80
[ 0.511332][ T0] ? lock_is_held_type+0x9c/0x110
[ 0.514106][ T0] unregister_console_locked+0xe1/0x450
[ 0.517144][ T0] register_console+0x509/0x620
[ 0.519827][ T0] ? __pfx_univ8250_console_init+0x10/0x10
[ 0.523042][ T0] univ8250_console_init+0x24/0x40
[ 0.525845][ T0] console_init+0x43/0x210
[ 0.528280][ T0] start_kernel+0x493/0x980
[ 0.530773][ T0] x86_64_start_reservations+0x18/0x30
[ 0.533755][ T0] x86_64_start_kernel+0xae/0xc0
[ 0.536473][ T0] common_startup_64+0x12c/0x138
[ 0.539210][ T0] </TASK>
And then the kernel goes into an infinite loop complaining about:
1. releasing a pinned lock
2. unpinning an unpinned lock
3. bad: scheduling from the idle thread!
4. goto 1
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Since ownership can be lost at any time due to handover or
takeover, a printing context _must_ be prepared to back out
immediately and carefully. However, there are scenarios where
the printing context must reacquire ownership in order to
finalize or revert hardware changes.
One such example is when interrupts are disabled during
printing. No other context will automagically re-enable the
interrupts. For this case, the disabling context _must_
reacquire nbcon ownership so that it can re-enable the
interrupts.
Provide nbcon_reacquire_nobuf() for exactly this purpose. It
allows a printing context to reacquire ownership using the same
priority as its previous ownership.
Note that after a successful reacquire the printing context
will have no output buffer because that has been lost. This
function cannot be used to resume printing.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240904120536.115780-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
There is no need to open code a non-migration-checking
this_cpu_ptr(). That is exactly what raw_cpu_ptr() is.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87plpum4jw.fsf@jogness.linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Building the kernel with W=1 generates the following warning:
kernel/cpu.c:2693: warning: This comment starts with '/**',
but isn't a kernel-doc comment.
The function topology_is_core_online() is a simple helper function and
doesn't need a kernel-doc comment.
Use a normal comment instead.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240825221152.71951-2-thorsten.blum@toblux.com
|
|
Pull in upstream changes so further patches don't conflict.
|
|
Global timers could be expired remotely when the target CPU is idle. After
a remote timer expiry, the remote timer_base->next_expiry value is updated
while holding the timer_base->lock. When the formerly idle CPU becomes
active at the same time and checks whether timers need to expire, this
check is done lockless as it is on the local CPU. This could lead to a data
race, which was reported by sysbot:
https://lore.kernel.org/r/000000000000916e55061f969e14@google.com
When the value is read lockless but changed by the remote CPU, only two non
critical scenarios could happen:
1) The already update value is read -> everything is perfect
2) The old value is read -> a superfluous timer soft interrupt is raised
The same situation could happen when enqueueing a new first pinned timer by
a remote CPU also with non critical scenarios:
1) The already update value is read -> everything is perfect
2) The old value is read -> when the CPU is idle, an IPI is executed
nevertheless and when the CPU isn't idle, the updated value will be visible
on the next tick and the timer might be late one jiffie.
As this is very unlikely to happen, the overhead of doing the check under
the lock is a way more effort, than a superfluous timer soft interrupt or a
possible 1 jiffie delay of the timer.
Document and annotate this non critical behavior in the code by using
READ/WRITE_ONCE() pair when accessing timer_base->next_expiry.
Reported-by: syzbot+bf285fcc0a048e028118@syzkaller.appspotmail.com
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20240829154305.19259-1-anna-maria@linutronix.de
Closes: https://lore.kernel.org/lkml/000000000000916e55061f969e14@google.com
|
|
sizeof(unsigned long) * 8 is the number of bits in an unsigned long
variable, replace it with BITS_PER_LONG macro to make it simpler.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240903035358.308482-1-ruanjinjie@huawei.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Use str_enabled_disabled() helper instead of open
coding the same.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
To prevent unitialized members, use kzalloc to allocate
the xol area.
Fixes: b059a453b1cf1 ("x86/vdso: Add mremap hook to vm_special_mapping")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20240903102313.3402529-1-svens@linux.ibm.com
|
|
Fix the condition to exclude the elfcorehdr segment from the SHA digest
calculation.
The j iterator is an index into the output sha_regions[] array, not into
the input image->segment[] array. Once it reaches
image->elfcorehdr_index, all subsequent segments are excluded. Besides,
if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr
may be wrongly included in the calculation.
Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com
Fixes: f7cc804a9fd4 ("kexec: exclude elfcorehdr from the segment digest")
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Eric DeVolder <eric_devolder@yahoo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
disable it thereby creating inconsistent state.
Reorder the logic so it actually works correctly
- The XSTATE logic for handling LBR is incorrect as it assumes that
XSAVES supports LBR when the CPU supports LBR. In fact both
conditions need to be true. Otherwise the enablement of LBR in the
IA32_XSS MSR fails and subsequently the machine crashes on the next
XRSTORS operation because IA32_XSS is not initialized.
Cache the XSTATE support bit during init and make the related
functions use this cached information and the LBR CPU feature bit to
cure this.
- Cure a long standing bug in KASLR
KASLR uses the full address space between PAGE_OFFSET and vaddr_end
to randomize the starting points of the direct map, vmalloc and
vmemmap regions. It thereby limits the size of the direct map by
using the installed memory size plus an extra configurable margin for
hot-plug memory. This limitation is done to gain more randomization
space because otherwise only the holes between the direct map,
vmalloc, vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel,
so the memory hot-plug and resource management related code paths
still operate under the assumption that the available address space
can be determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of
the direct map and if unlucky this address is in the vmalloc space,
which causes high_memory to become greater than VMALLOC_START and
consequently causes iounmap() to fail for valid ioremap addresses.
Cure this by exposing the end of the direct map via PHYSMEM_END and
use that for the memory hot-plug and resource management related
places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
PHYSMEM_END maps to a variable which is initialized by the KASLR
initialization and otherwise it is based on MAX_PHYSMEM_BITS as
before.
- Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
an initialized variabled on the stack to the VMM. The variable is
only required as output value, so it does not have to exposed to the
VMM in the first place.
- Prevent an array overrun in the resource control code on systems with
Sub-NUMA Clustering enabled because the code failed to adjust the
index by the number of SNC nodes per L3 cache.
* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix arch_mbm_* array overrun on SNC
x86/tdx: Fix data leak in mmio_read()
x86/kaslr: Expose and use the end of the physical memory address space
x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
x86/apic: Make x2apic_disable() work correctly
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Thomas Gleixner:
"A single fix for rt_mutex.
The deadlock detection code drops into an infinite scheduling loop
while still holding rt_mutex::wait_lock, which rightfully triggers a
'scheduling in atomic' warning.
Unlock it before that"
* tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Drop rt_mutex::wait_lock before scheduling
|
|
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under
CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that
user who adopted v2 don't have 'pay' for cpuset v1.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Some functions name declared in cpuset-internel.h are generic. To avoid
confilicting with other variables for the same name, rename these
functions with cpuset_/cpuset1_ prefix to make them unique to cpuset.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Move legacy cpuset controller interfaces files and corresponding code
into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and
'cpuset_common_seq_show' are also used for v1, so declare them in
cpuset-internal.h.
'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used
cpuset-v1.c now, make it static.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The validate_change_legacy functions is used for v1, move it to
cpuset-v1.c. And two micro 'cpuset_for_each_child' and
'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to
cpuset-internal.h.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There are some differents about hotplug update between cpuset v1 and
cpuset v2. Move the legacy code to cpuset-v1.c.
'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset
v1 and cpuset v2, declare them in cpuset-internal.h.
The change from original code is that use callback_lock helpers to get
callback_lock lock/unlock.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add
helpers for cpuset-v1 to get callback_lock.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
'memory_spread' is only set in cpuset v1. move corresponding code into
cpuset-v1.c.
Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are
exposed to cpuset.c.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Setting domain level is not supported at cpuset v2, so move corresponding
code into cpuset-v1.c.
The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting
domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c.
After cpuset legacy interface files are move to cpuset-v1.c, they can
be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c.
The change from original code is that using 'cpuset_lock' and
'cpuset_unlock' functions to lock or unlock cpuset_mutex.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Collection of memory_pressure can be enabled by writing 1 to the cpuset
file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore,
move the corresponding code to cpuset-v1.c.
Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called
at cpuset.c, so expose them to cpuset.c.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Move some declarations that will be used for cpuset v1 and v2,
including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc.
No logical change.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This patch introduces the cgroup/cpuset-v1.c source file which will be
used for all legacy (cgroup v1) cpuset cgroup code. It also introduces
cgroup/cpuset-internal.h to keep declarations shared between
cgroup/cpuset.c and cpuset/cpuset-v1.c.
As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be
switched to use a separate config option, so that the legacy code won't be
compiled if not required.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
With the "isolcpus" boot command line parameter, we are able to
create isolated CPUs at boot time. These isolated CPUs aren't fully
accounted for in the cpuset code. For instance, the root cgroup's
"cpuset.cpus.isolated" control file does not include the boot time
isolated CPUs. Fix that by looking for pre-isolated CPUs at init time.
The prstate_housekeeping_conflict() function does check the
HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it
can only be used in isolated partition. Given the fact that we are going
to make housekeeping cpumasks dynamic, the current check may not be right
anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against
it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The pointer returned by btf_parse_base could be an error pointer.
IS_ERR() check is needed before calling btf_free(base_btf).
Fixes: 8646db238997 ("libbpf,bpf: Share BTF relocate-related code with kernel")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240830012214.1646005-1-martin.lau@linux.dev
|
|
Use perf_allow_kernel() for 'pa_enable' (physical addresses),
'pct_enable' (physical timestamps) and context IDs. This means that
perf_event_paranoid is now taken into account and LSM hooks can be used,
which is more consistent with other perf_event_open calls. For example
PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just
perfmon_capable().
This also indirectly fixes the following error message which is
misleading because perf_event_paranoid is not taken into account by
perfmon_capable():
$ perf record -e arm_spe/pa_enable/
Error:
Access to performance monitoring and observability operations is
limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid
setting ...
Suggested-by: Al Grant <al.grant@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20240827145113.1224604-1-james.clark@linaro.org
Link: https://lore.kernel.org/all/20240807120039.GD37996@noisy.programming.kicks-ass.net/
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In the case where we are forcing the ps.chunk_size to be at least 1,
we are ignoring the caller's alignment.
Move the forcing of ps.chunk_size to be at least 1 before rounding it
up to caller's alignment, so that caller's alignment is honored.
While at it, use max() to force the ps.chunk_size to be at least 1 to
improve readability.
Fixes: 6d45e1c948a8 ("padata: Fix possible divide-by-0 panic in padata_mt_helper()")
Signed-off-by: Kamlesh Gurudasani <kamlesh@ti.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Use IS_ERR_OR_NULL() instead of open-coding a NULL and a error pointer
check.
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240828122724.3697447-1-lihongbo22@huawei.com
|
|
Let kmemdup_array() take care about sizing instead of doing it open coded.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240828072219.1249250-1-ruanjinjie@huawei.com
|
|
Currently, when the affinity of an irq cannot be set due to lack of
permission, the write_irq_affinity() returns the error code -EIO.
Change the return value to -EPERM as that reflects the cause of error
correctly.
Signed-off-by: Jeff Xie <jeff.xie@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240826145805.5938-1-jeff.xie@linux.dev
|
|
irq_move_pending() encapsulates irqd_is_setaffinity_pending() depending on
CONFIG_GENERIC_PENDING_IRQ.
Replace the open coded #ifdeffery with it.
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240829111522.230595-1-ruanjinjie@huawei.com
|
|
The kernel already knows at the time of interrupt allocation whether
affinity of an interrupt can be controlled by userspace or not.
It still creates all related procfs control files with read/write
permissions. That's inconsistent and non-intuitive for system
administrators and tools.
Therefore set the file permissions to read-only for such interrupts.
[ tglx: Massage change log, fixed UP build ]
Signed-off-by: Jeff Xie <jeff.xie@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240825131911.107119-1-jeff.xie@linux.dev
|
|
msleep() and msleep_interruptible() add a jiffie to the requested timeout.
This extra jiffie was introduced to ensure that the timeout will not happen
earlier than specified.
Since the rework of the timer wheel, the enqueue path already takes care of
this. So the extra jiffie added by msleep*() is pointless now.
Remove this extra jiffie in msleep() and msleep_interruptible().
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20240829074133.4547-1-anna-maria@linutronix.de
|
|
In a few audit records, PIDs were being recorded with task_pid_nr()
instead of task_tgid_nr().
$ grep "task_pid_nr" kernel/audit*.c
audit.c: task_pid_nr(current),
auditfilter.c: pid = task_pid_nr(current);
auditsc.c: audit_log_format(ab, " pid=%u", task_pid_nr(current));
For single-thread applications, the process id (pid) and the thread
group id (tgid) are the same. However, on multi-thread applications,
task_pid_nr() returns the current thread id (user-space's TID), while
task_tgid_nr() returns the main thread id (user-space's PID). Since
the users are more interested in the process id (pid), rather than the
thread id (tid), this patch converts these callers to the correct method.
Link: https://github.com/linux-audit/audit-kernel/issues/126
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Kunkun Jiang reports that for a workload involving the simultaneous startup
of a large number of VMs (for a total of about 200 vcpus), a lot of CPU
time gets spent on spinning on the tmp_mask_lock that exists as a static
raw spinlock in irq_do_set_affinity(). This lock protects a global cpumask
(tmp_mask) that is used as a temporary variable to compute the resulting
affinity.
While this is triggered by KVM issuing a irq_set_affinity() call each time
a vcpu is about to execute, it is obvious that having a single global
resource is not very scalable.
Since a cpumask can be a fairly large structure on systems with a high core
count, a stack allocation is not really appropriate. Instead, turn the
global cpumask into a per-CPU variable, removing the need for locking
altogether as the code is executed with preemption and interrupts disabled.
[ tglx: Moved the per CPU variable declaration outside of the function ]
Reported-by: Kunkun Jiang <jiangkunkun@huawei.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kunkun Jiang <jiangkunkun@huawei.com>
Link: https://lore.kernel.org/all/20240826080618.3886694-1-maz@kernel.org
Link: https://lore.kernel.org/all/a7fc58e4-64c2-77fc-c1dc-f5eb78dbbb01@huawei.com
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- Ensure that backing files uses file->f_ops->splice_write() for
splice
netfs:
- Revert the removal of PG_private_2 from netfs_release_folio() as
cephfs still relies on this
- When AS_RELEASE_ALWAYS is set on a mapping the folio needs to
always be invalidated during truncation
- Fix losing untruncated data in a folio by making letting
netfs_release_folio() return false if the folio is dirty
- Fix trimming of streaming-write folios in netfs_inval_folio()
- Reset iterator before retrying a short read
- Fix interaction of streaming writes with zero-point tracker
afs:
- During truncation afs currently calls truncate_setsize() which sets
i_size, expands the pagecache and truncates it. The first two
operations aren't needed because they will have already been done.
So call truncate_pagecache() instead and skip the redundant parts
overlayfs:
- Fix checking of the number of allowed lower layers so 500 layers
can actually be used instead of just 499
- Add missing '\n' to pr_err() output
- Pass string to ovl_parse_layer() and thus allow it to be used for
Opt_lowerdir as well
pidfd:
- Revert blocking the creation of pidfds for kthread as apparently
userspace relies on this. Specifically, it breaks systemd during
shutdown
romfs:
- Fix romfs_read_folio() to use the correct offset with
folio_zero_tail()"
* tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix interaction of streaming writes with zero-point tracker
netfs: Fix missing iterator reset on retry of short read
netfs: Fix trimming of streaming-write folios in netfs_inval_folio()
netfs: Fix netfs_release_folio() to say no if folio dirty
afs: Fix post-setattr file edit to do truncation correctly
mm: Fix missing folio invalidation calls during truncation
ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err
ovl: fix wrong lowerdir number check for parameter Opt_lowerdir
ovl: pass string to ovl_parse_layer()
backing-file: convert to using fops->splice_write
Revert "pidfd: prevent creation of pidfds for kthreads"
romfs: fix romfs_read_folio()
netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Three patches addressing cpuset corner cases"
* tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug
cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set
cgroup/cpuset: fix panic caused by partcmd_update
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Nothing too interesting. One patch to remove spurious warning and
others to address static checker warnings"
* tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Correct declaration of cpu_pwq in struct workqueue_struct
workqueue: Fix spruious data race in __flush_work()
workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker
workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()
workqueue: doc: Fix function name, remove markers
|
|
The timerslack_ns setting is used to specify how much the hardware
timers should be delayed, to potentially dispatch multiple timers in a
single interrupt. This is a performance optimization. Timers of
realtime tasks (having a realtime scheduling policy) should not be
delayed.
This logic was inconsitently applied to the hrtimers, leading to delays
of realtime tasks which used timed waits for events (e.g. condition
variables). Due to the downstream override of the slack for rt tasks,
the procfs reported incorrect (non-zero) timerslack_ns values.
This is changed by setting the timer_slack_ns task attribute to 0 for
all tasks with a rt policy. By that, downstream users do not need to
specially handle rt tasks (w.r.t. the slack), and the procfs entry
shows the correct value of "0". Setting non-zero slack values (either
via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored,
as stated in "man 2 PR_SET_TIMERSLACK":
Timer slack is not applied to threads that are scheduled under a
real-time scheduling policy (see sched_setscheduler(2)).
The special handling of timerslack on rt tasks in downstream users
is removed as well.
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
|
|
This reverts commit 3b5bbe798b2451820e74243b738268f51901e7d0.
Eric reported that systemd-shutdown gets broken by blocking the creating
of pidfds for kthreads as older versions seems to rely on being able to
create a pidfd for any process in /proc.
Reported-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/r/20240818035818.GA1929@sol.localdomain
Signed-off-by: Christian Brauner <brauner@kernel.org>
|