summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-09-04printk: Implement legacy printer kthread for PREEMPT_RTJohn Ogness
The write() callback of legacy consoles usually makes use of spinlocks. This is not permitted with PREEMPT_RT in atomic contexts. For PREEMPT_RT, create a new kthread to handle printing of all the legacy consoles (and nbcon consoles if boot consoles are registered). This allows legacy consoles to work on PREEMPT_RT without requiring modification. (However they will not have the reliability properties guaranteed by nbcon atomic consoles.) Use the existing printk_kthreads_check_locked() to start/stop the legacy kthread as needed. Introduce the macro force_legacy_kthread() to query if the forced threading of legacy consoles is in effect. Although currently only enabled for PREEMPT_RT, this acts as a simple mechanism for the future to allow other preemption models to easily take advantage of the non-interference property provided by the legacy kthread. When force_legacy_kthread() is true, the legacy kthread fulfills the role of the console_flush_type @legacy_offload by waking the legacy kthread instead of printing via the console_lock in the irq_work. If the legacy kthread is not yet available, no legacy printing takes place (unless in panic). If for some reason the legacy kthread fails to create, any legacy consoles are unregistered. With force_legacy_kthread(), the legacy kthread is a critical component for legacy consoles. These changes only affect CONFIG_PREEMPT_RT. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-16-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Show replay message on takeoverJohn Ogness
An emergency or panic context can takeover console ownership while the current owner was printing a printk message. The atomic printer will re-print the message that the previous owner was printing. However, this can look confusing to the user and may even seem as though a message was lost. [3430014.1 [3430014.181123] usb 1-2: Product: USB Audio Add a new field @nbcon_prev_seq to struct console to track the sequence number to print that was assigned to the previous console owner. If this matches the sequence number to print that the current owner is assigned, then a takeover must have occurred. In this case, print an additional message to inform the user that the previous message is being printed again. [3430014.1 ** replaying previous printk message ** [3430014.181123] usb 1-2: Product: USB Audio Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-12-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: Provide helper for message prependingJohn Ogness
In order to support prepending different texts to printk messages, split out the prepending code into a helper function. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-11-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Rely on kthreads for normal operationJohn Ogness
Once the kthread is running and available (i.e. @printk_kthreads_running is set), the kthread becomes responsible for flushing any pending messages which are added in NBCON_PRIO_NORMAL context. Namely the legacy console_flush_all() and device_release() no longer flush the console. And nbcon_atomic_flush_pending() used by nbcon_cpu_emergency_exit() no longer flushes messages added after the emergency messages. The console context is safe when used by the kthread only when one of the following conditions are true: 1. Other caller acquires the console context with NBCON_PRIO_NORMAL with preemption disabled. It will release the context before rescheduling. 2. Other caller acquires the console context with NBCON_PRIO_NORMAL under the device_lock. 3. The kthread is the only context which acquires the console with NBCON_PRIO_NORMAL. This is satisfied for all atomic printing call sites: nbcon_legacy_emit_next_record() (#1) nbcon_atomic_flush_pending_con() (#1) nbcon_device_release() (#2) It is even double guaranteed when @printk_kthreads_running is set because then _only_ the kthread will print for NBCON_PRIO_NORMAL. (#3) Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-10-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Use thread callback if in task context for legacyJohn Ogness
When printing via console_lock, the write_atomic() callback is used for nbcon consoles. However, if it is known that the current context is a task context, the write_thread() callback can be used instead. Using write_thread() instead of write_atomic() helps to reduce large disabled preemption regions when the device_lock does not disable preemption. This is mainly a preparatory change to allow avoiding write_atomic() completely during normal operation if boot consoles are registered. As a side-effect, it also allows consolidating the printing code for legacy printing and the kthread printer. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-9-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Relocate nbcon_atomic_emit_one()John Ogness
Move nbcon_atomic_emit_one() so that it can be used by nbcon_kthread_func() in a follow-up commit. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-8-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Introduce printer kthreadsThomas Gleixner
Provide the main implementation for running a printer kthread per nbcon console that is takeover/handover aware. This includes: - new mandatory write_thread() callback - kthread creation - kthread main printing loop - kthread wakeup mechanism - kthread shutdown kthread creation is a bit tricky because consoles may register before kthreads can be created. In such cases, registration will succeed, even though no kthread exists. Once kthreads can be created, an early_initcall will set @printk_kthreads_ready. If there are no registered boot consoles, the early_initcall creates the kthreads for all registered nbcon consoles. If kthread creation fails, the related console is unregistered. If there are registered boot consoles when @printk_kthreads_ready is set, no kthreads are created until the final boot console unregisters. Once kthread creation finally occurs, @printk_kthreads_running is set so that the system knows kthreads are available for all registered nbcon consoles. If @printk_kthreads_running is already set when the console is registering, the kthread is created during registration. If kthread creation fails, the registration will fail. Until @printk_kthreads_running is set, console printing occurs directly via the console_lock. kthread shutdown on system shutdown/reboot is necessary to ensure the printer kthreads finish their printing so that the system can cleanly transition back to direct printing via the console_lock in order to reliably push out the final shutdown/reboot messages. @printk_kthreads_running is cleared before shutting down the individual kthreads. The kthread uses a new mandatory write_thread() callback that is called with both device_lock() and the console context acquired. The console ownership handling is necessary for synchronization against write_atomic() which is synchronized only via the console context ownership. The device_lock() serializes acquiring the console context with NBCON_PRIO_NORMAL. It is needed in case the device_lock() does not disable preemption. It prevents the following race: CPU0 CPU1 [ task A ] nbcon_context_try_acquire() # success with NORMAL prio # .unsafe == false; // safe for takeover [ schedule: task A -> B ] WARN_ON() nbcon_atomic_flush_pending() nbcon_context_try_acquire() # success with EMERGENCY prio # flushing nbcon_context_release() # HERE: con->nbcon_state is free # to take by anyone !!! nbcon_context_try_acquire() # success with NORMAL prio [ task B ] [ schedule: task B -> A ] nbcon_enter_unsafe() nbcon_context_can_proceed() BUG: nbcon_context_can_proceed() returns "true" because the console is owned by a context on CPU0 with NBCON_PRIO_NORMAL. But it should return "false". The console is owned by a context from task B and we do the check in a context from task A. Note that with these changes, the printer kthreads do not yet take over full responsibility for nbcon printing during normal operation. These changes only focus on the lifecycle of the kthreads. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-7-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Init @nbcon_seq to highest possibleJohn Ogness
When initializing an nbcon console, have nbcon_alloc() set @nbcon_seq to the highest possible sequence number. For all practical purposes, this will guarantee that the console will have nothing to print until later when @nbcon_seq is set to the proper initial printing value. This will be particularly important once kthread printing is introduced because nbcon_alloc() can create/start the kthread before the desired initial sequence number is known. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-6-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Add context to usable() and emit()John Ogness
The nbcon consoles will have two callbacks to be used for different contexts. In order to determine if an nbcon console is usable, console_is_usable() must know if it is a context that will need to use the optional write_atomic() callback. Also, nbcon_emit_next_record() must know which callback it needs to call. Add an extra parameter @use_atomic to console_is_usable() and nbcon_emit_next_record() to specify this. Since so far only the write_atomic() callback exists, @use_atomic is set to true for all call sites. For legacy consoles, @use_atomic is not used. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-5-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: Flush console on unregister_console()John Ogness
Ensure consoles have flushed pending records before unregistering. The console should print up to at least its related "console disabled" record. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-4-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: Fail pr_flush() if before SYSTEM_SCHEDULINGJohn Ogness
A follow-up change adds pr_flush() to console unregistration. However, with boot consoles unregistration can happen very early if there are also regular consoles registering as well. In this case the pr_flush() is not important because all consoles are flushed when checking the initial console sequence number. Allow pr_flush() to fail if @system_state has not yet reached SYSTEM_SCHEDULING. This avoids might_sleep() and msleep() explosions that would otherwise occur: [ 0.436739][ T0] printk: legacy console [ttyS0] enabled [ 0.439820][ T0] printk: legacy bootconsole [earlyser0] disabled [ 0.446822][ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002 [ 0.450491][ T0] 1 lock held by swapper/0/0: [ 0.457897][ T0] #0: ffffffff82ae5f88 (console_mutex){+.+.}-{4:4}, at: console_list_lock+0x20/0x70 [ 0.463141][ T0] Modules linked in: [ 0.465307][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc1+ #372 [ 0.469394][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 0.474402][ T0] Call Trace: [ 0.476246][ T0] <TASK> [ 0.481473][ T0] dump_stack_lvl+0x93/0xb0 [ 0.483949][ T0] dump_stack+0x10/0x20 [ 0.486256][ T0] __schedule_bug+0x68/0x90 [ 0.488753][ T0] __schedule+0xb9b/0xd80 [ 0.491179][ T0] ? lock_release+0xb5/0x270 [ 0.493732][ T0] schedule+0x43/0x170 [ 0.495998][ T0] schedule_timeout+0xc5/0x1e0 [ 0.498634][ T0] ? __pfx_process_timeout+0x10/0x10 [ 0.501522][ T0] ? msleep+0x13/0x50 [ 0.503728][ T0] msleep+0x3c/0x50 [ 0.505847][ T0] __pr_flush.constprop.0.isra.0+0x56/0x500 [ 0.509050][ T0] ? _printk+0x58/0x80 [ 0.511332][ T0] ? lock_is_held_type+0x9c/0x110 [ 0.514106][ T0] unregister_console_locked+0xe1/0x450 [ 0.517144][ T0] register_console+0x509/0x620 [ 0.519827][ T0] ? __pfx_univ8250_console_init+0x10/0x10 [ 0.523042][ T0] univ8250_console_init+0x24/0x40 [ 0.525845][ T0] console_init+0x43/0x210 [ 0.528280][ T0] start_kernel+0x493/0x980 [ 0.530773][ T0] x86_64_start_reservations+0x18/0x30 [ 0.533755][ T0] x86_64_start_kernel+0xae/0xc0 [ 0.536473][ T0] common_startup_64+0x12c/0x138 [ 0.539210][ T0] </TASK> And then the kernel goes into an infinite loop complaining about: 1. releasing a pinned lock 2. unpinning an unpinned lock 3. bad: scheduling from the idle thread! 4. goto 1 Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Add function for printers to reacquire ownershipJohn Ogness
Since ownership can be lost at any time due to handover or takeover, a printing context _must_ be prepared to back out immediately and carefully. However, there are scenarios where the printing context must reacquire ownership in order to finalize or revert hardware changes. One such example is when interrupts are disabled during printing. No other context will automagically re-enable the interrupts. For this case, the disabling context _must_ reacquire nbcon ownership so that it can re-enable the interrupts. Provide nbcon_reacquire_nobuf() for exactly this purpose. It allows a printing context to reacquire ownership using the same priority as its previous ownership. Note that after a successful reacquire the printing context will have no output buffer because that has been lost. This function cannot be used to resume printing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240904120536.115780-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04printk: nbcon: Use raw_cpu_ptr() instead of open codingJohn Ogness
There is no need to open code a non-migration-checking this_cpu_ptr(). That is exactly what raw_cpu_ptr() is. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/87plpum4jw.fsf@jogness.linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-04cpu: Fix W=1 build kernel-doc warningThorsten Blum
Building the kernel with W=1 generates the following warning: kernel/cpu.c:2693: warning: This comment starts with '/**', but isn't a kernel-doc comment. The function topology_is_core_online() is a simple helper function and doesn't need a kernel-doc comment. Use a normal comment instead. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240825221152.71951-2-thorsten.blum@toblux.com
2024-09-04Merge branch 'linus' into smp/coreThomas Gleixner
Pull in upstream changes so further patches don't conflict.
2024-09-04timers: Annotate possible non critical data race of next_expiryAnna-Maria Behnsen
Global timers could be expired remotely when the target CPU is idle. After a remote timer expiry, the remote timer_base->next_expiry value is updated while holding the timer_base->lock. When the formerly idle CPU becomes active at the same time and checks whether timers need to expire, this check is done lockless as it is on the local CPU. This could lead to a data race, which was reported by sysbot: https://lore.kernel.org/r/000000000000916e55061f969e14@google.com When the value is read lockless but changed by the remote CPU, only two non critical scenarios could happen: 1) The already update value is read -> everything is perfect 2) The old value is read -> a superfluous timer soft interrupt is raised The same situation could happen when enqueueing a new first pinned timer by a remote CPU also with non critical scenarios: 1) The already update value is read -> everything is perfect 2) The old value is read -> when the CPU is idle, an IPI is executed nevertheless and when the CPU isn't idle, the updated value will be visible on the next tick and the timer might be late one jiffie. As this is very unlikely to happen, the overhead of doing the check under the lock is a way more effort, than a superfluous timer soft interrupt or a possible 1 jiffie delay of the timer. Document and annotate this non critical behavior in the code by using READ/WRITE_ONCE() pair when accessing timer_base->next_expiry. Reported-by: syzbot+bf285fcc0a048e028118@syzkaller.appspotmail.com Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20240829154305.19259-1-anna-maria@linutronix.de Closes: https://lore.kernel.org/lkml/000000000000916e55061f969e14@google.com
2024-09-04printk: Use the BITS_PER_LONG macroJinjie Ruan
sizeof(unsigned long) * 8 is the number of bits in an unsigned long variable, replace it with BITS_PER_LONG macro to make it simpler. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240903035358.308482-1-ruanjinjie@huawei.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-09-03audit: Make use of str_enabled_disabled() helperHongbo Li
Use str_enabled_disabled() helper instead of open coding the same. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-09-03uprobes: Use kzalloc to allocate xol areaSven Schnelle
To prevent unitialized members, use kzalloc to allocate the xol area. Fixes: b059a453b1cf1 ("x86/vdso: Add mremap hook to vm_special_mapping") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240903102313.3402529-1-svens@linux.ibm.com
2024-09-01kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=yPetr Tesarik
Fix the condition to exclude the elfcorehdr segment from the SHA digest calculation. The j iterator is an index into the output sha_regions[] array, not into the input image->segment[] array. Once it reaches image->elfcorehdr_index, all subsequent segments are excluded. Besides, if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr may be wrongly included in the calculation. Link: https://lkml.kernel.org/r/20240805150750.170739-1-petr.tesarik@suse.com Fixes: f7cc804a9fd4 ("kexec: exclude elfcorehdr from the segment digest") Signed-off-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Eric DeVolder <eric_devolder@yahoo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01Merge tag 'x86-urgent-2024-09-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Reorder the logic so it actually works correctly - The XSTATE logic for handling LBR is incorrect as it assumes that XSAVES supports LBR when the CPU supports LBR. In fact both conditions need to be true. Otherwise the enablement of LBR in the IA32_XSS MSR fails and subsequently the machine crashes on the next XRSTORS operation because IA32_XSS is not initialized. Cache the XSTATE support bit during init and make the related functions use this cached information and the LBR CPU feature bit to cure this. - Cure a long standing bug in KASLR KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of an initialized variabled on the stack to the VMM. The variable is only required as output value, so it does not have to exposed to the VMM in the first place. - Prevent an array overrun in the resource control code on systems with Sub-NUMA Clustering enabled because the code failed to adjust the index by the number of SNC nodes per L3 cache. * tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Fix arch_mbm_* array overrun on SNC x86/tdx: Fix data leak in mmio_read() x86/kaslr: Expose and use the end of the physical memory address space x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported x86/apic: Make x2apic_disable() work correctly
2024-09-01Merge tag 'locking-urgent-2024-08-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A single fix for rt_mutex. The deadlock detection code drops into an infinite scheduling loop while still holding rt_mutex::wait_lock, which rightfully triggers a 'scheduling in atomic' warning. Unlock it before that" * tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Drop rt_mutex::wait_lock before scheduling
2024-08-30cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1Chen Ridong
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that user who adopted v2 don't have 'pay' for cpuset v1. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: rename functions shared between v1 and v2Chen Ridong
Some functions name declared in cpuset-internel.h are generic. To avoid confilicting with other variables for the same name, rename these functions with cpuset_/cpuset1_ prefix to make them unique to cpuset. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move v1 interfaces to cpuset-v1.cChen Ridong
Move legacy cpuset controller interfaces files and corresponding code into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and 'cpuset_common_seq_show' are also used for v1, so declare them in cpuset-internal.h. 'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used cpuset-v1.c now, make it static. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move validate_change_legacy to cpuset-v1.cChen Ridong
The validate_change_legacy functions is used for v1, move it to cpuset-v1.c. And two micro 'cpuset_for_each_child' and 'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to cpuset-internal.h. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move legacy hotplug update to cpuset-v1.cChen Ridong
There are some differents about hotplug update between cpuset v1 and cpuset v2. Move the legacy code to cpuset-v1.c. 'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset v1 and cpuset v2, declare them in cpuset-internal.h. The change from original code is that use callback_lock helpers to get callback_lock lock/unlock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: add callback_lock helperChen Ridong
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add helpers for cpuset-v1 to get callback_lock. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move memory_spread to cpuset-v1.cChen Ridong
'memory_spread' is only set in cpuset v1. move corresponding code into cpuset-v1.c. Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are exposed to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move relax_domain_level to cpuset-v1.cChen Ridong
Setting domain level is not supported at cpuset v2, so move corresponding code into cpuset-v1.c. The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c. After cpuset legacy interface files are move to cpuset-v1.c, they can be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c. The change from original code is that using 'cpuset_lock' and 'cpuset_unlock' functions to lock or unlock cpuset_mutex. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move memory_pressure to cpuset-v1.cChen Ridong
Collection of memory_pressure can be enabled by writing 1 to the cpuset file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore, move the corresponding code to cpuset-v1.c. Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called at cpuset.c, so expose them to cpuset.c. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: move common code to cpuset-internal.hChen Ridong
Move some declarations that will be used for cpuset v1 and v2, including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc. No logical change. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: introduce cpuset-v1.cChen Ridong
This patch introduces the cgroup/cpuset-v1.c source file which will be used for all legacy (cgroup v1) cpuset cgroup code. It also introduces cgroup/cpuset-internal.h to keep declarations shared between cgroup/cpuset.c and cpuset/cpuset-v1.c. As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be switched to use a separate config option, so that the legacy code won't be compiled if not required. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30cgroup/cpuset: Account for boot time isolated CPUsWaiman Long
With the "isolcpus" boot command line parameter, we are able to create isolated CPUs at boot time. These isolated CPUs aren't fully accounted for in the cpuset code. For instance, the root cgroup's "cpuset.cpus.isolated" control file does not include the boot time isolated CPUs. Fix that by looking for pre-isolated CPUs at init time. The prstate_housekeeping_conflict() function does check the HK_TYPE_DOMAIN housekeeping cpumask to make sure that CPUs outside of it can only be used in isolated partition. Given the fact that we are going to make housekeeping cpumasks dynamic, the current check may not be right anymore. Save the boot time HK_TYPE_DOMAIN cpumask and check against it instead of the upcoming dynamic HK_TYPE_DOMAIN housekeeping cpumask. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-30bpf: Fix a crash when btf_parse_base() returns an error pointerMartin KaFai Lau
The pointer returned by btf_parse_base could be an error pointer. IS_ERR() check is needed before calling btf_free(base_btf). Fixes: 8646db238997 ("libbpf,bpf: Share BTF relocate-related code with kernel") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240830012214.1646005-1-martin.lau@linux.dev
2024-08-30drivers/perf: arm_spe: Use perf_allow_kernel() for permissionsJames Clark
Use perf_allow_kernel() for 'pa_enable' (physical addresses), 'pct_enable' (physical timestamps) and context IDs. This means that perf_event_paranoid is now taken into account and LSM hooks can be used, which is more consistent with other perf_event_open calls. For example PERF_SAMPLE_PHYS_ADDR uses perf_allow_kernel() rather than just perfmon_capable(). This also indirectly fixes the following error message which is misleading because perf_event_paranoid is not taken into account by perfmon_capable(): $ perf record -e arm_spe/pa_enable/ Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting ... Suggested-by: Al Grant <al.grant@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20240827145113.1224604-1-james.clark@linaro.org Link: https://lore.kernel.org/all/20240807120039.GD37996@noisy.programming.kicks-ass.net/ Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30padata: Honor the caller's alignment in case of chunk_size 0Kamlesh Gurudasani
In the case where we are forcing the ps.chunk_size to be at least 1, we are ignoring the caller's alignment. Move the forcing of ps.chunk_size to be at least 1 before rounding it up to caller's alignment, so that caller's alignment is honored. While at it, use max() to force the ps.chunk_size to be at least 1 to improve readability. Fixes: 6d45e1c948a8 ("padata: Fix possible divide-by-0 panic in padata_mt_helper()") Signed-off-by: Kamlesh Gurudasani <kamlesh@ti.com> Acked-by:  Waiman Long <longman@redhat.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-29irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()Hongbo Li
Use IS_ERR_OR_NULL() instead of open-coding a NULL and a error pointer check. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240828122724.3697447-1-lihongbo22@huawei.com
2024-08-29genirq/msi: Use kmemdup_array() instead of kmemdup()Jinjie Ruan
Let kmemdup_array() take care about sizing instead of doing it open coded. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240828072219.1249250-1-ruanjinjie@huawei.com
2024-08-29genirq/proc: Change the return value for set affinity permission errorJeff Xie
Currently, when the affinity of an irq cannot be set due to lack of permission, the write_irq_affinity() returns the error code -EIO. Change the return value to -EPERM as that reflects the cause of error correctly. Signed-off-by: Jeff Xie <jeff.xie@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240826145805.5938-1-jeff.xie@linux.dev
2024-08-29genirq/proc: Use irq_move_pending() in show_irq_affinity()Jinjie Ruan
irq_move_pending() encapsulates irqd_is_setaffinity_pending() depending on CONFIG_GENERIC_PENDING_IRQ. Replace the open coded #ifdeffery with it. Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240829111522.230595-1-ruanjinjie@huawei.com
2024-08-29genirq/proc: Correctly set file permissions for affinity control filesJeff Xie
The kernel already knows at the time of interrupt allocation whether affinity of an interrupt can be controlled by userspace or not. It still creates all related procfs control files with read/write permissions. That's inconsistent and non-intuitive for system administrators and tools. Therefore set the file permissions to read-only for such interrupts. [ tglx: Massage change log, fixed UP build ] Signed-off-by: Jeff Xie <jeff.xie@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240825131911.107119-1-jeff.xie@linux.dev
2024-08-29timers: Remove historical extra jiffie for timeout in msleep()Anna-Maria Behnsen
msleep() and msleep_interruptible() add a jiffie to the requested timeout. This extra jiffie was introduced to ensure that the timeout will not happen earlier than specified. Since the rework of the timer wheel, the enqueue path already takes care of this. So the extra jiffie added by msleep*() is pointless now. Remove this extra jiffie in msleep() and msleep_interruptible(). Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/all/20240829074133.4547-1-anna-maria@linutronix.de
2024-08-28audit: use task_tgid_nr() instead of task_pid_nr()Ricardo Robaina
In a few audit records, PIDs were being recorded with task_pid_nr() instead of task_tgid_nr(). $ grep "task_pid_nr" kernel/audit*.c audit.c: task_pid_nr(current), auditfilter.c: pid = task_pid_nr(current); auditsc.c: audit_log_format(ab, " pid=%u", task_pid_nr(current)); For single-thread applications, the process id (pid) and the thread group id (tgid) are the same. However, on multi-thread applications, task_pid_nr() returns the current thread id (user-space's TID), while task_tgid_nr() returns the main thread id (user-space's PID). Since the users are more interested in the process id (pid), rather than the thread id (tid), this patch converts these callers to the correct method. Link: https://github.com/linux-audit/audit-kernel/issues/126 Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-08-27genirq: Get rid of global lock in irq_do_set_affinity()Marc Zyngier
Kunkun Jiang reports that for a workload involving the simultaneous startup of a large number of VMs (for a total of about 200 vcpus), a lot of CPU time gets spent on spinning on the tmp_mask_lock that exists as a static raw spinlock in irq_do_set_affinity(). This lock protects a global cpumask (tmp_mask) that is used as a temporary variable to compute the resulting affinity. While this is triggered by KVM issuing a irq_set_affinity() call each time a vcpu is about to execute, it is obvious that having a single global resource is not very scalable. Since a cpumask can be a fairly large structure on systems with a high core count, a stack allocation is not really appropriate. Instead, turn the global cpumask into a per-CPU variable, removing the need for locking altogether as the code is executed with preemption and interrupts disabled. [ tglx: Moved the per CPU variable declaration outside of the function ] Reported-by: Kunkun Jiang <jiangkunkun@huawei.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kunkun Jiang <jiangkunkun@huawei.com> Link: https://lore.kernel.org/all/20240826080618.3886694-1-maz@kernel.org Link: https://lore.kernel.org/all/a7fc58e4-64c2-77fc-c1dc-f5eb78dbbb01@huawei.com
2024-08-27Merge tag 'vfs-6.11-rc6.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
2024-08-24Merge tag 'cgroup-for-6.11-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Three patches addressing cpuset corner cases" * tag 'cgroup-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplug cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if cpus.exclusive not set cgroup/cpuset: fix panic caused by partcmd_update
2024-08-24Merge tag 'wq-for-6.11-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Nothing too interesting. One patch to remove spurious warning and others to address static checker warnings" * tag 'wq-for-6.11-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Correct declaration of cpu_pwq in struct workqueue_struct workqueue: Fix spruious data race in __flush_work() workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" from dying worker workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask() workqueue: doc: Fix function name, remove markers
2024-08-23hrtimer: Use and report correct timerslack values for realtime tasksFelix Moessbauer
The timerslack_ns setting is used to specify how much the hardware timers should be delayed, to potentially dispatch multiple timers in a single interrupt. This is a performance optimization. Timers of realtime tasks (having a realtime scheduling policy) should not be delayed. This logic was inconsitently applied to the hrtimers, leading to delays of realtime tasks which used timed waits for events (e.g. condition variables). Due to the downstream override of the slack for rt tasks, the procfs reported incorrect (non-zero) timerslack_ns values. This is changed by setting the timer_slack_ns task attribute to 0 for all tasks with a rt policy. By that, downstream users do not need to specially handle rt tasks (w.r.t. the slack), and the procfs entry shows the correct value of "0". Setting non-zero slack values (either via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored, as stated in "man 2 PR_SET_TIMERSLACK": Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)). The special handling of timerslack on rt tasks in downstream users is removed as well. Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
2024-08-21Revert "pidfd: prevent creation of pidfds for kthreads"Christian Brauner
This reverts commit 3b5bbe798b2451820e74243b738268f51901e7d0. Eric reported that systemd-shutdown gets broken by blocking the creating of pidfds for kthreads as older versions seems to rely on being able to create a pidfd for any process in /proc. Reported-by: Eric Biggers <ebiggers@kernel.org> Link: https://lore.kernel.org/r/20240818035818.GA1929@sol.localdomain Signed-off-by: Christian Brauner <brauner@kernel.org>