summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-08-21rcu: Mark emergency sections in rcu stallsJohn Ogness
Mark emergency sections wherever multiple lines of rcu stall information are generated. In an emergency section, every printk() call will attempt to directly flush to the consoles using the EMERGENCY priority. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240820063001.36405-35-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21panic: Mark emergency section in oopsJohn Ogness
Mark an emergency section beginning with oops_enter() until the end of oops_exit(). In this section, every printk() call will attempt to directly flush to the consoles using the EMERGENCY priority. The very end of oops_exit() performs a kmsg_dump(). This is not included in the emergency section because it is another flushing mechanism that should occur after the consoles have flushed the oops messages. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-34-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21panic: Mark emergency section in warnThomas Gleixner
Mark the full contents of __warn() as an emergency section. In this section, every printk() call will attempt to directly flush to the consoles using the EMERGENCY priority. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-33-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Implement emergency sectionsThomas Gleixner
In emergency situations (something has gone wrong but the system continues to operate), usually important information (such as a backtrace) is generated via printk(). This information should be pushed out to the consoles ASAP. Add per-CPU emergency nesting tracking because an emergency can arise while in an emergency situation. Add functions to mark the beginning and end of emergency sections where the urgent messages are generated. Perform direct console flushing at the emergency priority if the current CPU is in an emergency state and it is safe to do so. Note that the emergency state is not system-wide. While one CPU is in an emergency state, another CPU may attempt to print console messages at normal priority. Also note that printk() already attempts to flush consoles in the caller context for normal priority. However, follow-up changes will introduce printing kthreads, in which case the normal priority printk() calls will offload to the kthreads. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-32-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Add helper for flush type logicJohn Ogness
There are many call sites where console flushing occur. Depending on the system state and types of consoles, the flush methods to use are different. A flush call site generally must consider: @have_boot_console @have_nbcon_console @have_legacy_console @legacy_allow_panic_sync is_printk_preferred() and take into account the current CPU state: NBCON_PRIO_NORMAL NBCON_PRIO_EMERGENCY NBCON_PRIO_PANIC in order to decide if it should: flush nbcon directly via atomic_write() callback flush legacy directly via console_unlock flush legacy via offload to irq_work All of these call sites use their own logic to make this decision, which is complicated and error prone. Especially later when two more flush methods will be introduced: flush nbcon via offload to kthread flush legacy via offload to kthread Introduce a new internal struct console_flush_type that specifies which console flushing methods should be used in the context of the caller. Introduce a helper function to fill out console_flush_type to be used for flushing call sites. Replace the logic of all flushing call sites to use the new helper. This change standardizes behavior, leading to both fixes and optimizations across various call sites. For instance, in console_cpu_notify(), the new logic ensures that nbcon consoles are flushed when they aren’t managed by the legacy loop. Similarly, in console_flush_on_panic(), the system no longer needs to flush nbcon consoles if none are present. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-31-john.ogness@linutronix.de [pmladek@suse.com: Updated the commit message.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Coordinate direct printing in panicJohn Ogness
If legacy and nbcon consoles are registered and the nbcon consoles are allowed to flush (i.e. no boot consoles registered), the legacy consoles will no longer perform direct printing on the panic CPU until after the backtrace has been stored. This will give the safe nbcon consoles a chance to print the panic messages before allowing the unsafe legacy consoles to print. If no nbcon consoles are registered or they are not allowed to flush because boot consoles are registered, there is no change in behavior (i.e. legacy consoles will always attempt to print from the printk() caller context). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-30-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Track nbcon consolesJohn Ogness
Add a global flag @have_nbcon_console to identify if any nbcon consoles are registered. This will be used in follow-up commits to preserve legacy behavior when no nbcon consoles are registered. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-29-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Avoid console_lock dance if no legacy or boot consolesJohn Ogness
Currently the console lock is used to attempt legacy-type printing even if there are no legacy or boot consoles registered. If no such consoles are registered, the console lock does not need to be taken. Add tracking of legacy console registration and use it with boot console tracking to avoid unnecessary code paths, i.e. do not use the console lock if there are no boot consoles and no legacy consoles. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-28-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Add unsafe flushing on panicJohn Ogness
Add nbcon_atomic_flush_unsafe() to flush all nbcon consoles using the write_atomic() callback and allowing unsafe hostile takeovers. Call this at the end of panic() as a final attempt to flush any pending messages. Note that legacy consoles use unsafe methods for flushing from the beginning of panic (see bust_spinlocks()). Therefore, systems using both legacy and nbcon consoles may still fail to see panic messages due to unsafe legacy console usage. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-27-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Flush nbcon consoles first on panicJohn Ogness
In console_flush_on_panic(), flush the nbcon consoles before flushing legacy consoles. The legacy write() callbacks are not fully safe when oops_in_progress is set. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-26-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Flush new records on device_release()John Ogness
There may be new records that were added while a driver was holding the nbcon context for non-printing purposes. These new records must be flushed by the nbcon_device_release() context because no other context will do it. If boot consoles are registered, the legacy loop is used (either direct or per irq_work) to handle the flushing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-25-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Add is_printk_legacy_deferred()John Ogness
If printk has been explicitly deferred or is called from NMI context, legacy console printing must be deferred to an irq_work context. Introduce a helper function is_printk_legacy_deferred() for a CPU to query if it must defer legacy console printing. In follow-up commits this helper will be needed at other call sites as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-24-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Use nbcon consoles in console_flush_all()John Ogness
Allow nbcon consoles to print messages in the legacy printk() caller context (printing via unlock) by integrating them into console_flush_all(). The write_atomic() callback is used for printing. Provide nbcon_legacy_emit_next_record(), which acts as the nbcon variant of console_emit_next_record(). Call this variant within console_flush_all() for nbcon consoles. Since nbcon consoles use their own @nbcon_seq variable to track the next record to print, this also must be appropriately handled in console_flush_all(). Note that the legacy printing logic uses @handover to detect handovers for printing all consoles. For nbcon consoles, handovers/takeovers occur on a per-console basis and thus do not cause the console_flush_all() loop to abort. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-23-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Track registered boot consolesJohn Ogness
Unfortunately it is not known if a boot console and a regular (legacy or nbcon) console use the same hardware. For this reason they must not be allowed to print simultaneously. For legacy consoles this is not an issue because they are already synchronized with the boot consoles using the console lock. However nbcon consoles can be triggered separately. Add a global flag @have_boot_console to identify if any boot consoles are registered. This will be used in follow-up commits to ensure that boot consoles and nbcon consoles cannot print simultaneously. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-22-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Provide function to flush using write_atomic()Thomas Gleixner
Provide nbcon_atomic_flush_pending() to perform flushing of all registered nbcon consoles using their write_atomic() callback. Unlike console_flush_all(), nbcon_atomic_flush_pending() will only flush up through the newest record at the time of the call. This prevents a CPU from printing unbounded when other CPUs are adding records. If new records are added while flushing, it is expected that the dedicated printer threads will print those records. If the printer thread is not available (which is always the case at this point in the rework), nbcon_atomic_flush_pending() _will_ flush all records in the ringbuffer. Unlike console_flush_all(), nbcon_atomic_flush_pending() will fully flush one console before flushing the next. This helps to guarantee that a block of pending records (such as a stack trace in an emergency situation) can be printed atomically at once before releasing console ownership. nbcon_atomic_flush_pending() is safe in any context because it uses write_atomic() and acquires with unsafe_takeover disabled. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-21-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Add helper to assign priority based on CPU stateJohn Ogness
Add a helper function to use the current state of the CPU to determine which priority to assign to the printing context. The EMERGENCY priority handling is added in a follow-up commit. It will use a per-CPU variable. Note: nbcon_device_try_acquire(), which is used by console drivers to acquire the nbcon console for non-printing activities, is hard-coded to always use NORMAL priority. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-20-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Add @flags argument for console_is_usable()John Ogness
The caller of console_is_usable() usually needs @console->flags for its own checks. Rather than having console_is_usable() read its own copy, make the caller pass in the @flags. This also ensures that the caller saw the same @flags value. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-19-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Let console_is_usable() handle nbconJohn Ogness
The nbcon consoles use a different printing callback. For nbcon consoles, check for the write_atomic() callback instead of write(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-18-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Make console_is_usable() available to nbcon.cJohn Ogness
Move console_is_usable() as-is into internal.h so that it can be used by nbcon printing functions as well. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-17-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Do not rely on proxy headersJohn Ogness
The headers kernel.h, serial_core.h, and console.h allow for the definitions of many types and functions from other headers. Rather than relying on these as proxy headers, explicitly include all headers providing needed definitions. Also sort the list alphabetically to be able to easily detect duplicates. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-16-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21nbcon: Add API to acquire context for non-printing operationsJohn Ogness
Provide functions nbcon_device_try_acquire() and nbcon_device_release() which will try to acquire the nbcon console ownership with NBCON_PRIO_NORMAL and mark it unsafe for handover/takeover. These functions are to be used together with the device-specific locking when performing non-printing activities on the console device. They will allow synchronization against the atomic_write() callback which will be serialized, for higher priority contexts, only by acquiring the console context ownership. Pitfalls: The API requires to be called in a context with migration disabled because it uses per-CPU variables internally. The context is set unsafe for a takeover all the time. It guarantees full serialization against any atomic_write() caller except for the final flush in panic() which might try an unsafe takeover. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-14-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Use driver synchronization while (un)registeringJohn Ogness
Console drivers typically have to deal with access to the hardware via user input/output (such as an interactive login shell) and output of kernel messages via printk() calls. They use some classic driver-specific locking mechanism in most situations. But console->write_atomic() callbacks, used by nbcon consoles, are synchronized only by acquiring the console context. The synchronization via the console context ownership is possible only when the console driver is registered. It is when a particular device driver is connected with a particular console driver. The two synchronization mechanisms must be synchronized between each other. It is tricky because the console context ownership is quite special. It might be taken over by a higher priority context. Also CPU migration must be disabled. The most tricky part is to (dis)connect these two mechanisms during the console (un)registration. Use the driver-specific locking callbacks: device_lock(), device_unlock(). They allow taking the device-specific lock while the device is being (un)registered by the related console driver. For example, these callbacks lock/unlock the port lock for serial port drivers. Note that the driver-specific locking is only needed during (un)register if it is an nbcon console with the write_atomic() callback implemented. If write_atomic() is not implemented, the driver should never attempt to access the hardware without first acquiring its driver-specific lock. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-10-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Remove return value for write_atomic()John Ogness
The return value of write_atomic() does not provide any useful information. On the contrary, it makes things more complicated for the caller to appropriately deal with the information. Change write_atomic() to not have a return value. If the message did not get printed due to loss of ownership, the caller will notice this on its own. If ownership was not lost, it will be assumed that the driver successfully printed the message and the sequence number for that console will be incremented. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-7-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Clarify rules of the owner/waiter matchingJohn Ogness
The functions nbcon_owner_matches() and nbcon_waiter_matches() use a minimal set of data to determine if a context matches. The existing kerneldoc and comments were not clear enough and caused the printk folks to re-prove that the functions are indeed reliable in all cases. Update and expand the explanations so that it is clear that the implementations are sufficient for all cases. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-6-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Check printk_deferred_enter()/_exit() usageSebastian Andrzej Siewior
Add validation that printk_deferred_enter()/_exit() are called in non-migration contexts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-5-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Properly deal with nbcon consoles on seq initPetr Mladek
If a non-boot console is registering and boot consoles exist, the consoles are flushed before being unregistered. This allows the non-boot console to continue where the boot console left off. If for whatever reason flushing fails, the lowest seq found from any of the enabled boot consoles is used. Until now con->seq was checked. However, if it is an nbcon boot console, the function nbcon_seq_read() must be used to read seq because con->seq is not updated for nbcon consoles. Check if it is an nbcon boot console and if so call nbcon_seq_read() to read seq. Also, avoid usage of con->seq as temporary storage of the starting record. Instead, rename console_init_seq() to get_init_console_seq() and just return the value. For nbcon consoles set the sequence via nbcon_seq_force(), for legacy consoles set con->seq. The cleaned design should make sure that the value stays and is set before the console is added to the console list. It also unifies the sequence number initialization for legacy and nbcon consoles. Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20240820063001.36405-4-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: nbcon: Consolidate alloc() and init()John Ogness
Rather than splitting the nbcon allocation and initialization into two pieces, perform all initialization in nbcon_alloc(). Later, the initial sequence is calculated and can be explicitly set using nbcon_seq_force(). This removes the need for the strong rules of nbcon_init() that even included a BUG_ON(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240820063001.36405-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-21printk: Add notation to console_srcu lockingJohn Ogness
kernel/printk/printk.c:284:5: sparse: sparse: context imbalance in 'console_srcu_read_lock' - wrong count at exit include/linux/srcu.h:301:9: sparse: sparse: context imbalance in 'console_srcu_read_unlock' - unexpected unlock Fixes: 6c4afa79147e ("printk: Prepare for SRCU console list protection") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240820063001.36405-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-08-20workqueue: Don't call va_start / va_end twiceMatthew Brost
Calling va_start / va_end multiple times is undefined and causes problems with certain compiler / platforms. Change alloc_ordered_workqueue_lockdep_map to a macro and updated __alloc_workqueue to take a va_list argument. Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20sched_ext: Allow dequeue_task_scx to failYipeng Zou
Since dequeue_task() allowed to fail, there is a compile error: kernel/sched/ext.c:3630:19: error: initialization of ‘bool (*)(struct rq*, struct task_struct *, int)’ {aka ‘_Bool (*)(struct rq *, struct task_struct *, int)’} from incompatible pointer type ‘void (*)(struct rq*, struct task_struct *, int)’ 3630 | .dequeue_task = dequeue_task_scx, | ^~~~~~~~~~~~~~~~ Allow dequeue_task_scx to fail too. Fixes: 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20Merge branch 'tip/sched/core' into for-6.12Tejun Heo
To receive 863ccdbb918a ("sched: Allow sched_class::dequeue_task() to fail") which makes sched_class.dequeue_task() return bool instead of void. This leads to compile breakage and will be fixed by a follow-up patch. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: remove use_parent_ecpus of cpusetChen Ridong
use_parent_ecpus is used to track whether the children are using the parent's effective_cpus. When a parent's effective_cpus is changed due to changes in a child partition's effective_xcpus, any child using parent'effective_cpus must call update_cpumasks_hier. However, if a child is not a valid partition, it is sufficient to determine whether to call update_cpumasks_hier based on whether the child's effective_cpus is going to change. To make the code more succinct, it is suggested to remove use_parent_ecpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: remove fetch_xcpusChen Ridong
Both fetch_xcpus and user_xcpus functions are used to retrieve the value of exclusive_cpus. If exclusive_cpus is not set, cpus_allowed is the implicit value used as exclusive in a local partition. I can not imagine a scenario where effective_xcpus is not empty when exclusive_cpus is empty. Therefore, I suggest removing the fetch_xcpus function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20cgroup/cpuset: Correct invalid remote parition prsChen Ridong
When enable a remote partition, I found that: cd /sys/fs/cgroup/ mkdir test mkdir test/test1 echo +cpuset > cgroup.subtree_control echo +cpuset > test/cgroup.subtree_control echo 3 > test/test1/cpuset.cpus echo root > test/test1/cpuset.cpus.partition cat test/test1/cpuset.cpus.partition root invalid (Parent is not a partition root) The parent of a remote partition could not be a root. This is due to the emtpy effective_xcpus. It would be better to prompt the message "invalid cpu list in cpuset.cpus.exclusive". Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-20softirq: Remove unused 'action' parameter from action callbackCaleb Sander Mateos
When soft interrupt actions are called, they are passed a pointer to the struct softirq action which contains the action's function pointer. This pointer isn't useful, as the action callback already knows what function it is. And since each callback handles a specific soft interrupt, the callback also knows which soft interrupt number is running. No soft interrupt action callback actually uses this parameter, so remove it from the function pointer signature. This clarifies that soft interrupt actions are global routines and makes it slightly cheaper to call them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-20irqdomain: Always associate interrupts for legacy domainsMatti Vaittinen
The unification of irq_domain_create_legacy() missed the fact that interrupts must be associated even when the Linux interrupt number provided in the first_irq argument is 0. This breaks all call sites of irq_domain_create_legacy() which supply 0 as the first_irq argument. Enforce the association for legacy domains in __irq_domain_instantiate() to cure this. [ tglx: Massaged it slightly. ] Fixes: 70114e7f7585 ("irqdomain: Simplify simple and legacy domain creation") Reported-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by Matti Vaittinen <mazziesaccount@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Link: https://lore.kernel.org/all/c3379142-10bc-4f14-b8ac-a46927aeac38@gmail.com
2024-08-20x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner
iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-19bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_*Matteo Croce
The helper bpf_current_task_under_cgroup() currently is only allowed for tracing programs, allow its usage also in the BPF_CGROUP_* program types. Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c, so it compiles also without CONFIG_BPF_EVENTS. This will be used in systemd-networkd to monitor the sysctl writes, and filter it's own writes from others: https://github.com/systemd/systemd/pull/32212 Signed-off-by: Matteo Croce <teknoraver@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com
2024-08-19bpf: Enable generic kfuncs for BPF_CGROUP_* programsMatteo Croce
These kfuncs are enabled even in BPF_PROG_TYPE_TRACING, so they should be safe also in BPF_CGROUP_* programs. Since all BPF_CGROUP_* programs share the same hook, call register_btf_kfunc_id_set() only once. In enum btf_kfunc_hook, rename BTF_KFUNC_HOOK_CGROUP_SKB to a more generic BTF_KFUNC_HOOK_CGROUP, since it's used for all the cgroup related program types. Signed-off-by: Matteo Croce <teknoraver@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240819162805.78235-2-technoboy85@gmail.com
2024-08-19cgroup: update some statememt about delegationChen Ridong
The comment in cgroup_file_write is missing some interfaces, such as 'cgroup.threads'. All delegatable files are listed in '/sys/kernel/cgroup/delegate', so update the comment in cgroup_file_write. Besides, add a statement that files outside the namespace shouldn't be visible from inside the delegated namespace. tj: Reflowed text for consistency. Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-19Merge tag 'printk-for-6.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Do not block printk on non-panic CPUs when they are dumping backtraces * tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
2024-08-17Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard()
2024-08-17Merge tag 'powerpc-6.11-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix crashes on 85xx with some configs since the recent hugepd rework. - Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL on some platforms. - Don't enable offline cores when changing SMT modes, to match existing userspace behaviour. Thanks to Christophe Leroy, Dr. David Alan Gilbert, Guenter Roeck, Nysal Jan K.A, Shrikanth Hegde, Thomas Gleixner, and Tyrel Datwyler. * tag 'powerpc-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/topology: Check if a core is online cpu/SMT: Enable SMT only if a core is online powerpc/mm: Fix boot warning with hugepages and CONFIG_DEBUG_VIRTUAL powerpc/mm: Fix size of allocated PGDIR soc: fsl: qbman: remove unused struct 'cgr_comp'
2024-08-17sched/eevdf: Propagate min_slice up the cgroup hierarchyPeter Zijlstra
In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.948188417@infradead.org
2024-08-17sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestionPeter Zijlstra
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.842834421@infradead.org
2024-08-17sched/eevdf: Allow shorter slices to wakeup-preemptPeter Zijlstra
Part of the reason to have shorter slices is to improve responsiveness. Allow shorter slices to preempt longer slices on wakeup. Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms | 100ms massive_intr 500us cyclictest NO_PREEMPT_SHORT 1 massive_intr:(5) | 846018.956 ms | 779188 | avg: 0.273 ms | max: 58.337 ms | sum:212545.245 ms | 2 massive_intr:(5) | 853450.693 ms | 792269 | avg: 0.275 ms | max: 71.193 ms | sum:218263.588 ms | 3 massive_intr:(5) | 843888.920 ms | 771456 | avg: 0.277 ms | max: 92.405 ms | sum:213353.221 ms | 1 chromium-browse:(8) | 53015.889 ms | 131766 | avg: 0.463 ms | max: 36.341 ms | sum:60959.230 ms | 2 chromium-browse:(8) | 53864.088 ms | 136962 | avg: 0.480 ms | max: 27.091 ms | sum:65687.681 ms | 3 chromium-browse:(9) | 53637.904 ms | 132637 | avg: 0.481 ms | max: 24.756 ms | sum:63781.673 ms | 1 cyclictest:(5) | 12615.604 ms | 639689 | avg: 0.471 ms | max: 32.272 ms | sum:301351.094 ms | 2 cyclictest:(5) | 12511.583 ms | 642578 | avg: 0.448 ms | max: 44.243 ms | sum:287632.830 ms | 3 cyclictest:(5) | 12545.867 ms | 635953 | avg: 0.475 ms | max: 25.530 ms | sum:302374.658 ms | 100ms massive_intr 500us cyclictest PREEMPT_SHORT 1 massive_intr:(5) | 839843.919 ms | 837384 | avg: 0.264 ms | max: 74.366 ms | sum:221476.885 ms | 2 massive_intr:(5) | 852449.913 ms | 845086 | avg: 0.252 ms | max: 68.162 ms | sum:212595.968 ms | 3 massive_intr:(5) | 839180.725 ms | 836883 | avg: 0.266 ms | max: 69.742 ms | sum:222812.038 ms | 1 chromium-browse:(11) | 54591.481 ms | 138388 | avg: 0.458 ms | max: 35.427 ms | sum:63401.508 ms | 2 chromium-browse:(8) | 52034.541 ms | 132276 | avg: 0.436 ms | max: 31.826 ms | sum:57732.958 ms | 3 chromium-browse:(8) | 55231.771 ms | 141892 | avg: 0.469 ms | max: 27.607 ms | sum:66538.697 ms | 1 cyclictest:(5) | 13156.391 ms | 667412 | avg: 0.373 ms | max: 38.247 ms | sum:249174.502 ms | 2 cyclictest:(5) | 12688.939 ms | 665144 | avg: 0.374 ms | max: 33.548 ms | sum:248509.392 ms | 3 cyclictest:(5) | 13475.623 ms | 669110 | avg: 0.370 ms | max: 37.819 ms | sum:247673.390 ms | As per the numbers the, this makes cyclictest (short slice) it's max-delay more consistent and consistency drops the sum-delay. The trade-off is that the massive_intr (long slice) gets more context switches and a slight increase in sum-delay. Chunxin contributed did_preempt_short() where a task that lost slice protection from PREEMPT_SHORT gets rescheduled once it becomes in-eligible. [mike: numbers] Co-Developed-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Chunxin Zang <zangchunxin@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Link: https://lkml.kernel.org/r/20240727105030.735459544@infradead.org
2024-08-17sched/fair: Avoid re-setting virtual deadline on 'migrations'Peter Zijlstra
During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.625119246@infradead.org
2024-08-17sched/eevdf: Fixup PELT vs DELAYED_DEQUEUEPeter Zijlstra
Note that tasks that are kept on the runqueue to burn off negative lag, are not in fact runnable anymore, they'll get dequeued the moment they get picked. As such, don't count this time towards runnable. Thanks to Valentin for spotting I had this backwards initially. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.514088302@infradead.org
2024-08-17sched/fair: Implement DELAY_ZEROPeter Zijlstra
'Extend' DELAY_DEQUEUE by noting that since we wanted to dequeued them at the 0-lag point, truncate lag (eg. don't let them earn positive lag). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.403750550@infradead.org
2024-08-17sched/fair: Implement delayed dequeuePeter Zijlstra
Extend / fix 86bfbb7ce4f6 ("sched/fair: Add lag based placement") by noting that lag is fundamentally a temporal measure. It should not be carried around indefinitely. OTOH it should also not be instantly discarded, doing so will allow a task to game the system by purposefully (micro) sleeping at the end of its time quantum. Since lag is intimately tied to the virtual time base, a wall-time based decay is also insufficient, notably competition is required for any of this to make sense. Instead, delay the dequeue and keep the 'tasks' on the runqueue, competing until they are eligible. Strictly speaking, we only care about keeping them until the 0-lag point, but that is a difficult proposition, instead carry them around until they get picked again, and dequeue them at that point. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.226163742@infradead.org