summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-09-20futex/pi: Fix recursive rt_mutex waiter statePeter Zijlstra
Some new assertions pointed out that the existing code has nested rt_mutex wait state in the futex code. Specifically, the futex_lock_pi() cancel case uses spin_lock() while there still is a rt_waiter enqueued for this task, resulting in a state where there are two waiters for the same task (and task_struct::pi_blocked_on gets scrambled). The reason to take hb->lock at this point is to avoid the wake_futex_pi() EAGAIN case. This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes inconsistent. The current rules are such that this inconsistency will not be observed. Notably the case that needs to be avoided is where futex_lock_pi() and futex_unlock_pi() interleave such that unlock will fail to observe a new waiter. *However* the case at hand is where a waiter is leaving, in this case the race means a waiter that is going away is not observed -- which is harmless, provided this race is explicitly handled. This is a somewhat dangerous proposition because the converse race is not observing a new waiter, which must absolutely not happen. But since the race is valid this cannot be asserted. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230915151943.GD6743@noisy.programming.kicks-ass.net
2023-09-20locking/rtmutex: Add a lockdep assert to catch potential nested blockingThomas Gleixner
There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition functions, but that vanished in one of the rtmutex overhauls. Bring it back in form of a lockdep assert to catch code paths which take rtmutex based locks with current::pi_blocked_on != NULL. Reported-by: Crystal Wood <swood@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-7-bigeasy@linutronix.de
2023-09-20locking/rtmutex: Use rt_mutex specific scheduler helpersSebastian Andrzej Siewior
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid recursion vs rtlock on the PI state. [[ peterz: adapted to new names ]] Reported-by: Crystal Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-6-bigeasy@linutronix.de
2023-09-20sched: Provide rt_mutex specific scheduler helpersPeter Zijlstra
With PREEMPT_RT there is a rt_mutex recursion problem where sched_submit_work() can use an rtlock (aka spinlock_t). More specifically what happens is: mutex_lock() /* really rt_mutex */ ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter // do PI chain walk rt_mutex_slowlock_block() schedule() sched_submit_work() ... spin_lock() /* really rtlock */ ... __rt_mutex_slowlock_locked() task_blocks_on_rt_mutex() // enqueue current task as waiter *AGAIN* // *CONFUSION* Fix this by making rt_mutex do the sched_submit_work() early, before it enqueues itself as a waiter -- before it even knows *if* it will wait. [[ basically Thomas' patch but with different naming and a few asserts added ]] Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de
2023-09-20sched: Extract __schedule_loop()Thomas Gleixner
There are currently two implementations of this basic __schedule() loop, and there is soon to be a third. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de
2023-09-20locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXESSebastian Andrzej Siewior
With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire() always fails and all lock operations take the slow path. Provide a new helper inline rt_mutex_try_acquire() which maps to rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case it invokes rt_mutex_slowtrylock() which can acquire a non-contended rtmutex under full debug coverage. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-3-bigeasy@linutronix.de
2023-09-20sched: Constrain locks in sched_submit_work()Peter Zijlstra
Even though sched_submit_work() is ran from preemptible context, it is discouraged to have it use blocking locks due to the recursion potential. Enforce this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de
2023-09-19pidfd: prevent a kernel-doc warningRandy Dunlap
Change the comment to match the function name that the SYSCALL_DEFINE() macros generate to prevent a kernel-doc warning. kernel/pid.c:628: warning: expecting prototype for pidfd_open(). Prototype was for sys_pidfd_open() instead Link: https://lkml.kernel.org/r/20230912060822.2500-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19task_work: add kerneldoc annotation for 'data' argumentJens Axboe
A previous commit changed the arguments to task_work_cancel_match(), but didn't document all of them. Link: https://lkml.kernel.org/r/93938bff-baa3-4091-85f5-784aae297a07@kernel.dk Fixes: c7aab1a7c52b ("task_work: add helper for more targeted task_work canceling") Signed-off-by: Jens Axboe <axboe@kernel.dk> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309120307.zis3yQGe-lkp@intel.com/ Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19signal: Don't disable preemption in ptrace_stop() on PREEMPT_RTSebastian Andrzej Siewior
On PREEMPT_RT keeping preemption disabled during the invocation of cgroup_enter_frozen() is a problem because the function acquires css_set_lock which is a sleeping lock on PREEMPT_RT and must not be acquired with disabled preemption. The preempt-disabled section is only for performance optimisation reasons and can be avoided. Extend the comment and don't disable preemption before scheduling on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230803100932.325870-3-bigeasy@linutronix.de
2023-09-19signal: Add a proper comment about preempt_disable() in ptrace_stop()Sebastian Andrzej Siewior
Commit 53da1d9456fe7 ("fix ptrace slowness") added a preempt-disable section between read_unlock() and the following schedule() invocation without explaining why it is needed. Replace the existing contentless comment with a proper explanation to clarify that it is not needed for correctness but for performance reasons. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230803100932.325870-2-bigeasy@linutronix.de
2023-09-19bpf: Remove unused variables.Alexei Starovoitov
Remove unused prev_offset, min_size, krec_size variables. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309190634.fL17FWoT-lkp@intel.com/ Fixes: aaa619ebccb2 ("bpf: Refactor check_btf_func and split into two phases") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19bpf: Fix bpf_throw warning on 32-bit archKumar Kartikeya Dwivedi
On 32-bit architectures, the pointer width is 32-bit, while we try to cast from a u64 down to it, the compiler complains on mismatch in integer size. Fix this by first casting to long which should match the pointer width on targets supported by Linux. Fixes: ec5290a178b7 ("bpf: Prevent KASAN false positive with bpf_throw") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230918155233.297024-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19kernel/sched: Modify initial boot task idle setupLiam R. Howlett
Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early. This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up. Fix this issue by setting the PF_IDLE flag later in the boot sequence. Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee. Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
2023-09-19sched/fair: Rename check_preempt_curr() to wakeup_preempt()Ingo Molnar
The name is a bit opaque - make it clear that this is about wakeup preemption. Also rename the ->check_preempt_curr() methods similarly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-09-19sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()Ingo Molnar
Other scheduling classes already postfix their similar methods with the class name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-09-18cgroup/cpuset: Check partition conflict with housekeeping setupWaiman Long
A user can pre-configure certain CPUs in an isolated state at boot time with the "isolcpus" kernel boot command line option. Those CPUs will not be in the housekeeping_cpumask(HK_TYPE_DOMAIN) and so will not be in any sched domains. This may conflict with the partition setup at runtime. Those boot time isolated CPUs should only be used in an isolated partition. This patch adds the necessary check and disallows partition setup if the check fails. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Introduce remote partitionWaiman Long
One can use "cpuset.cpus.partition" to create multiple scheduling domains or to produce a set of isolated CPUs where load balancing is disabled. The former use case is less common but the latter one can be frequently used especially for the Telco use cases like DPDK. The existing "isolated" partition can be used to produce isolated CPUs if the applications have full control of a system. However, in a containerized environment where all the apps are run in a container, it is hard to distribute out isolated CPUs from the root down given the unified hierarchy nature of cgroup v2. The container running on isolated CPUs can be several layers down from the root. The current partition feature requires that all the ancestors of a leaf partition root must be parititon roots themselves. This can be hard to configure. This patch introduces a new type of partition called remote partition. A remote partition is a partition whose parent is not a partition root itself and its CPUs are acquired directly from available CPUs in the top cpuset through a hierachical distribution of exclusive CPUs down from it. By contrast, the existing type of partitions where their parents have to be valid partition roots are referred to as local partitions as they have to be clustered around a parent partition root. Child local partitons can be created under a remote partition, but a remote partition cannot be created under a local partition. We may relax this limitation in the future if there are use cases for such configuration. Manually writing to the "cpuset.cpus.exclusive" file is not necessary when creating local partitions. However, writing proper values to "cpuset.cpus.exclusive" down the cgroup hierarchy before the target remote partition root is mandatory for the creation of a remote partition. The value in "cpuset.cpus.exclusive.effective" may change if its "cpuset.cpus" or its parent's "cpuset.cpus.exclusive.effective" changes. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Add cpuset.cpus.exclusive for v2Waiman Long
This patch introduces a new writable "cpuset.cpus.exclusive" control file for v2 which will be added to non-root cpuset enabled cgroups. This new file enables user to set a smaller list of exclusive CPUs to be used in the creation of a cpuset partition. The value written to "cpuset.cpus.exclusive" may not be the effective value being used for the creation of cpuset partition, the effective value will show up in "cpuset.cpus.exclusive.effective" and it is subject to the constraint that it must also be a subset of cpus_allowed and parent's "cpuset.cpus.exclusive.effective". By writing to "cpuset.cpus.exclusive", "cpuset.cpus.exclusive.effective" may be set to a non-empty value even for cgroups that are not valid partition roots yet. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2Waiman Long
The creation of a cpuset partition means dedicating a set of exclusive CPUs to be used by a particular partition only. These exclusive CPUs will not be used by any cpusets outside of that partition. To enable more flexibility in creating partitions, we need a way to distribute exclusive CPUs that can be used in new partitions. Currently, we have a subparts_cpus cpumask in struct cpuset that tracks only the exclusive CPUs used by all the sub-partitions underneath a given cpuset. This patch reworks the way we do exclusive CPUs tracking. The subparts_cpus is now renamed to effective_xcpus which tracks the exclusive CPUs allocated to a partition root including those that are further distributed down to sub-partitions underneath it. IOW, it also includes the exclusive CPUs used by the current partition root. Note that effective_xcpus can contain offline CPUs and it will always be a subset of cpus_allowed. The renamed effective_xcpus is now exposed via a new read-only "cpuset.cpus.exclusive.effective" control file. The new effective_xcpus cpumask should be set to cpus_allowed when a cpuset becomes a partition root and be cleared if it is not a valid partition root. In the next patch, we will enable write to another new control file to enable further control of what can get into effective_xcpus. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup/cpuset: Fix load balance state in update_partition_sd_lb()Waiman Long
Commit a86ce68078b2 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") adds a new helper function update_partition_sd_lb() to update the load balance state of the cpuset. However the new load balance is determined by just looking at whether the cpuset is a valid isolated partition root or not. That is not enough if the cpuset is not a valid partition root but its parent is in the isolated state (load balance off). Update the function to set the new state to be the same as its parent in this case like what has been done in commit c8c926200c55 ("cgroup/cpuset: Inherit parent's load balance state in v2"). Fixes: a86ce68078b2 ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18Merge tag 'v6.6-rc2' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-18cgroup: Avoid extra dereference in css_populate_dir()Kamalesh Babulal
Use css directly instead of dereferencing it from &cgroup->self, while adding the cgroup v2 cft base and psi files in css_populate_dir(). Both points to the same css, when css->ss is NULL, this avoids extra deferences and makes code consistent in usage across the function. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18cgroup: Check for ret during cgroup1_base_files cft additionKamalesh Babulal
There is no check for possible failure while populating cgroup1_base_files cft in css_populate_dir(), like its cgroup v2 counter parts cgroup_{base,psi}_files. In case of failure, the cgroup might not be set up right. Add ret value check to return on failure. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18workqueue: Fix missed pwq_release_worker creation in ↵Zqiang
wq_cpu_intensive_thresh_init() Currently, if the wq_cpu_intensive_thresh_us is set to specific value, will cause the wq_cpu_intensive_thresh_init() early exit and missed creation of pwq_release_worker. this commit therefore create the pwq_release_worker in advance before checking the wq_cpu_intensive_thresh_us. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 967b494e2fd1 ("workqueue: Use a kthread_worker to release pool_workqueues")
2023-09-18workqueue: Removed double allocation of wq_update_pod_attrs_bufSteven Rostedt (Google)
First commit 2930155b2e272 ("workqueue: Initialize unbound CPU pods later in the boot") added the initialization of wq_update_pod_attrs_buf to workqueue_init_early(), and then latter on, commit 84193c07105c6 ("workqueue: Generalize unbound CPU pods") added it as well. This appeared in a kmemleak run where the second allocation made the first allocation leak. Fixes: 84193c07105c6 ("workqueue: Generalize unbound CPU pods") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-09-18printk: nbcon: Allow drivers to mark unsafe regions and check stateThomas Gleixner
For the write_atomic callback, the console driver may have unsafe regions that need to be appropriately marked. Provide functions that accept the nbcon_write_context struct to allow for the driver to enter and exit unsafe regions. Also provide a function for drivers to check if they are still the owner of the console. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-9-john.ogness@linutronix.de
2023-09-18printk: nbcon: Add emit function and callback function for atomic printingThomas Gleixner
Implement an emit function for nbcon consoles to output printk messages. It utilizes the lockless printk_get_next_message() and console_prepend_dropped() functions to retrieve/build the output message. The emit function includes the required safety points to check for handover/takeover and calls a new write_atomic callback of the console driver to output the message. It also includes proper handling for updating the nbcon console sequence number. A new nbcon_write_context struct is introduced. This is provided to the write_atomic callback and includes only the information necessary for performing atomic writes. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-8-john.ogness@linutronix.de
2023-09-18printk: nbcon: Add sequence handlingThomas Gleixner
Add an atomic_long_t field @nbcon_seq to the console struct to store the sequence number for nbcon consoles. For nbcon consoles this will be used instead of the non-atomic @seq field. The new field allows for safe atomic sequence number updates without requiring any locking. On 64bit systems the new field stores the full sequence number. On 32bit systems the new field stores the lower 32 bits of the sequence number, which are expanded to 64bit as needed by folding the values based on the sequence numbers available in the ringbuffer. For 32bit systems, having a 32bit representation in the console is sufficient. If a console ever gets more than 2^31 records behind the ringbuffer then this is the least of the problems. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-7-john.ogness@linutronix.de
2023-09-18printk: nbcon: Add ownership state functionsThomas Gleixner
Provide functions that are related to the safe handover mechanism and allow console drivers to dynamically specify unsafe regions: - nbcon_context_can_proceed() Invoked by a console owner to check whether a handover request is pending or whether the console has been taken over by another context. If a handover request is pending, this function will also perform the handover, thus cancelling its own ownership. - nbcon_context_enter_unsafe()/nbcon_context_exit_unsafe() Invoked by a console owner to denote that the driver is about to enter or leave a critical region where a take over is unsafe. The exit variant is the point where the current owner releases the lock for a higher priority context which asked for the friendly handover. The unsafe state is stored in the console state and allows a new context to make informed decisions whether to attempt a takeover of such a console. The unsafe state is also available to the driver so that it can make informed decisions about the required actions and possibly take a special emergency path. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-6-john.ogness@linutronix.de
2023-09-18printk: nbcon: Add buffer managementThomas Gleixner
In case of hostile takeovers it must be ensured that the previous owner cannot scribble over the output buffer of the emergency/panic context. This is achieved by: - Adding a global output buffer instance for the panic context. This is the only situation where hostile takeovers can occur and there is always at most 1 panic context. - Allocating an output buffer per non-boot console upon console registration. This buffer is used by the console owner when not in panic context. (For boot consoles, the existing shared global legacy output buffer is used instead. Boot console printing will be synchronized with legacy console printing.) - Choosing the appropriate buffer is handled in the acquire/release functions. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-5-john.ogness@linutronix.de
2023-09-18printk: Make static printk buffers available to nbconJohn Ogness
The nbcon boot consoles also need printk buffers that are available very early. Since the nbcon boot consoles will also be serialized by the console_lock, they can use the same static printk buffers that the legacy consoles are using. Make the legacy static printk buffers available outside of printk.c so they can be used by nbcon.c. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-4-john.ogness@linutronix.de
2023-09-18printk: nbcon: Add acquire/release logicThomas Gleixner
Add per console acquire/release functionality. The state of the console is maintained in the "nbcon_state" atomic variable. The console is locked when: - The 'prio' field contains the priority of the context that owns the console. Only higher priority contexts are allowed to take over the lock. A value of 0 (NBCON_PRIO_NONE) means the console is not locked. - The 'cpu' field denotes on which CPU the console is locked. It is used to prevent busy waiting on the same CPU. Also it informs the lock owner that it has lost the lock in a more complex scenario when the lock was taken over by a higher priority context, released, and taken on another CPU with the same priority as the interrupted owner. The acquire mechanism uses a few more fields: - The 'req_prio' field is used by the handover approach to make the current owner aware that there is a context with a higher priority waiting for the friendly handover. - The 'unsafe' field allows to take over the console in a safe way in the middle of emitting a message. The field is set only when accessing some shared resources or when the console device is manipulated. It can be cleared, for example, after emitting one character when the console device is in a consistent state. - The 'unsafe_takeover' field is set when a hostile takeover took the console in an unsafe state. The console will stay in the unsafe state until re-initialized. The acquire mechanism uses three approaches: 1) Direct acquire when the console is not owned or is owned by a lower priority context and is in a safe state. 2) Friendly handover mechanism uses a request/grant handshake. It is used when the current owner has lower priority and the console is in an unsafe state. The requesting context: a) Sets its priority into the 'req_prio' field. b) Waits (with a timeout) for the owning context to unlock the console. c) Takes the lock and clears the 'req_prio' field. The owning context: a) Observes the 'req_prio' field set on exit from the unsafe console state. b) Gives up console ownership by clearing the 'prio' field. 3) Unsafe hostile takeover allows to take over the lock even when the console is an unsafe state. It is used only in panic() by the final attempt to flush consoles in a try and hope mode. Note that separate record buffers are used in panic(). As a result, the messages can be read and formatted without any risk even after using the hostile takeover in unsafe state. The release function simply clears the 'prio' field. All operations on @console::nbcon_state are atomic cmpxchg based to handle concurrency. The acquire/release functions implement only minimal policies: - Preference for higher priority contexts. - Protection of the panic CPU. All other policy decisions must be made at the call sites: - What is marked as an unsafe section. - Whether to spin-wait if there is already an owner and the console is in an unsafe state. - Whether to attempt an unsafe hostile takeover. The design allows to implement the well known: acquire() output_one_printk_record() release() The output of one printk record might be interrupted with a higher priority context. The new owner is supposed to reprint the entire interrupted record from scratch. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-3-john.ogness@linutronix.de
2023-09-18printk: Add non-BKL (nbcon) console basic infrastructureThomas Gleixner
The current console/printk subsystem is protected by a Big Kernel Lock, (aka console_lock) which has ill defined semantics and is more or less stateless. This puts severe limitations on the console subsystem and makes forced takeover and output in emergency and panic situations a fragile endeavour that is based on try and pray. The goal of non-BKL (nbcon) consoles is to break out of the console lock jail and to provide a new infrastructure that avoids the pitfalls and also allows console drivers to be gradually converted over. The proposed infrastructure aims for the following properties: - Per console locking instead of global locking - Per console state that allows to make informed decisions - Stateful handover and takeover As a first step, state is added to struct console. The per console state is an atomic_t using a 32bit bit field. Reserve state bits, which will be populated later in the series. Wire it up into the console register/unregister functionality. It was decided to use a bitfield because using a plain u32 with mask/shift operations resulted in uncomprehensible code. Co-developed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230916192007.608398-2-john.ogness@linutronix.de
2023-09-18sched/headers: Remove duplicated includes in kernel/sched/sched.hGUO Zihua
Remove duplicated includes of linux/cgroup.h and linux/psi.h. Both of these includes are included regardless of the config and they are all protected by ifndef, so no point including them again. Signed-off-by: GUO Zihua <guozihua@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230818015633.18370-1-guozihua@huawei.com
2023-09-18sched/fair: Ratelimit update to tg->load_avgAaron Lu
When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. The update frequency is a tradeoff between tracking accuracy and overhead. 1ms is chosen because PELT window is roughly 1ms and it delivered good results for the tests that I've done. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: David Vernet <void@manifault.com> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
2023-09-18freezer,sched: Use saved_state to reduce some spurious wakeupsElliot Berman
After commit f5d39b020809 ("freezer,sched: Rewrite core freezer logic"), tasks that transition directly from TASK_FREEZABLE to TASK_FROZEN are always woken up on the thaw path. Prior to that commit, tasks could ask freezer to consider them "frozen enough" via freezer_do_not_count(). The commit replaced freezer_do_not_count() with a TASK_FREEZABLE state which allows freezer to immediately mark the task as TASK_FROZEN without waking up the task. This is efficient for the suspend path, but on the thaw path, the task is always woken up even if the task didn't need to wake up and goes back to its TASK_(UN)INTERRUPTIBLE state. Although these tasks are capable of handling of the wakeup, we can observe a power/perf impact from the extra wakeup. We observed on Android many tasks wait in the TASK_FREEZABLE state (particularly due to many of them being binder clients). We observed nearly 4x the number of tasks and a corresponding linear increase in latency and power consumption when thawing the system. The latency increased from ~15ms to ~50ms. Avoid the spurious wakeups by saving the state of TASK_FREEZABLE tasks. If the task was running before entering TASK_FROZEN state (__refrigerator()) or if the task received a wake up for the saved state, then the task is woken on thaw. saved_state from PREEMPT_RT locks can be re-used because freezer would not stomp on the rtlock wait flow: TASK_RTLOCK_WAIT isn't considered freezable. Reported-by: Prakash Viswalingam <quic_prakashv@quicinc.com> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-18sched/core: Remove ifdeffery for saved_stateElliot Berman
In preparation for freezer to also use saved_state, remove the CONFIG_PREEMPT_RT compilation guard around saved_state. On the arm64 platform I tested which did not have CONFIG_PREEMPT_RT, there was no statistically significant deviation by applying this patch. Test methodology: perf bench sched message -g 40 -l 40 Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-17Merge tag 'sched-urgent-2023-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix a performance regression on large SMT systems, an Intel SMT4 balancing bug, and a topology setup bug on (Intel) hybrid processors" * tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain sched/fair: Fix SMT4 group_smt_balance handling sched/fair: Optimize should_we_balance() for large SMT systems
2023-09-17Merge tag 'core-urgent-2023-09-17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull WARN fix from Ingo Molnar: "Fix a missing preempt-enable in the WARN() slowpath" * tag 'core-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: panic: Reenable preemption in WARN slowpath
2023-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 79 files changed, 5275 insertions(+), 600 deletions(-). The main changes are: 1) Basic BTF validation in libbpf, from Andrii Nakryiko. 2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi. 3) next_thread cleanups, from Oleg Nesterov. 4) Add mcpu=v4 support to arm32, from Puranjay Mohan. 5) Add support for __percpu pointers in bpf progs, from Yonghong Song. 6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang. 7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman", Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle), Song Liu, Stanislav Fomichev, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16bpf: Fix kfunc callback register type handlingKumar Kartikeya Dwivedi
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg type before using reg->subprogno. This can accidently permit invalid pointers from being passed into callback helpers (e.g. silently from different paths). Likewise, reg->subprogno from the per-register type union may not be meaningful either. We need to reject any other type except PTR_TO_FUNC. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 5d92ddc3de1b ("bpf: Add callback validation to kfunc verifier logic") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Disallow fentry/fexit/freplace for exception callbacksKumar Kartikeya Dwivedi
During testing, it was discovered that extensions to exception callbacks had no checks, upon running a testcase, the kernel ended up running off the end of a program having final call as bpf_throw, and hitting int3 instructions. The reason is that while the default exception callback would have reset the stack frame to return back to the main program's caller, the replacing extension program will simply return back to bpf_throw, which will instead return back to the program and the program will continue execution, now in an undefined state where anything could happen. The way to support extensions to an exception callback would be to mark the BPF_PROG_TYPE_EXT main subprog as an exception_cb, and prevent it from calling bpf_throw. This would make the JIT produce a prologue that restores saved registers and reset the stack frame. But let's not do that until there is a concrete use case for this, and simply disallow this for now. Similar issues will exist for fentry and fexit cases, where trampoline saves data on the stack when invoking exception callback, which however will then end up resetting the stack frame, and on return, the fexit program will never will invoked as the return address points to the main program's caller in the kernel. Instead of additional complexity and back and forth between the two stacks to enable such a use case, simply forbid it. One key point here to note is that currently X86_TAIL_CALL_OFFSET didn't require any modifications, even though we emit instructions before the corresponding endbr64 instruction. This is because we ensure that a main subprog never serves as an exception callback, and therefore the exception callback (which will be a global subprog) can never serve as the tail call target, eliminating any discrepancies. However, once we support a BPF_PROG_TYPE_EXT to also act as an exception callback, it will end up requiring change to the tail call offset to account for the extra instructions. For simplicitly, tail calls could be disabled for such targets. Noting the above, it appears better to wait for a concrete use case before choosing to permit extension programs to replace exception callbacks. As a precaution, we disable fentry and fexit for exception callbacks as well. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Detect IP == ksym.end as part of BPF programKumar Kartikeya Dwivedi
Now that bpf_throw kfunc is the first such call instruction that has noreturn semantics within the verifier, this also kicks in dead code elimination in unprecedented ways. For one, any instruction following a bpf_throw call will never be marked as seen. Moreover, if a callchain ends up throwing, any instructions after the call instruction to the eventually throwing subprog in callers will also never be marked as seen. The tempting way to fix this would be to emit extra 'int3' instructions which bump the jited_len of a program, and ensure that during runtime when a program throws, we can discover its boundaries even if the call instruction to bpf_throw (or to subprogs that always throw) is emitted as the final instruction in the program. An example of such a program would be this: do_something(): ... r0 = 0 exit foo(): r1 = 0 call bpf_throw r0 = 0 exit bar(cond): if r1 != 0 goto pc+2 call do_something exit call foo r0 = 0 // Never seen by verifier exit // main(ctx): r1 = ... call bar r0 = 0 exit Here, if we do end up throwing, the stacktrace would be the following: bpf_throw foo bar main In bar, the final instruction emitted will be the call to foo, as such, the return address will be the subsequent instruction (which the JIT emits as int3 on x86). This will end up lying outside the jited_len of the program, thus, when unwinding, we will fail to discover the return address as belonging to any program and end up in a panic due to the unreliable stack unwinding of BPF programs that we never expect. To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as part of the BPF program, so that is_bpf_text_address returns true when such a case occurs, and we are able to unwind reliably when the final instruction ends up being a call instruction. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Prevent KASAN false positive with bpf_throwKumar Kartikeya Dwivedi
The KASAN stack instrumentation when CONFIG_KASAN_STACK is true poisons the stack of a function when it is entered and unpoisons it when leaving. However, in the case of bpf_throw, we will never return as we switch our stack frame to the BPF exception callback. Later, this discrepancy will lead to confusing KASAN splats when kernel resumes execution on return from the BPF program. Fix this by unpoisoning everything below the stack pointer of the BPF program, which should cover the range that would not be unpoisoned. An example splat is below: BUG: KASAN: stack-out-of-bounds in stack_trace_consume_entry+0x14e/0x170 Write of size 8 at addr ffffc900013af958 by task test_progs/227 CPU: 0 PID: 227 Comm: test_progs Not tainted 6.5.0-rc2-g43f1c6c9052a-dirty #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-2.fc39 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4a/0x80 print_report+0xcf/0x670 ? arch_stack_walk+0x79/0x100 kasan_report+0xda/0x110 ? stack_trace_consume_entry+0x14e/0x170 ? stack_trace_consume_entry+0x14e/0x170 ? __pfx_stack_trace_consume_entry+0x10/0x10 stack_trace_consume_entry+0x14e/0x170 ? __sys_bpf+0xf2e/0x41b0 arch_stack_walk+0x8b/0x100 ? __sys_bpf+0xf2e/0x41b0 ? bpf_prog_test_run_skb+0x341/0x1c70 ? bpf_prog_test_run_skb+0x341/0x1c70 stack_trace_save+0x9b/0xd0 ? __pfx_stack_trace_save+0x10/0x10 ? __kasan_slab_free+0x109/0x180 ? bpf_prog_test_run_skb+0x341/0x1c70 ? __sys_bpf+0xf2e/0x41b0 ? __x64_sys_bpf+0x78/0xc0 ? do_syscall_64+0x3c/0x90 ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 kasan_save_stack+0x33/0x60 ? kasan_save_stack+0x33/0x60 ? kasan_set_track+0x25/0x30 ? kasan_save_free_info+0x2b/0x50 ? __kasan_slab_free+0x109/0x180 ? kmem_cache_free+0x191/0x460 ? bpf_prog_test_run_skb+0x341/0x1c70 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2b/0x50 __kasan_slab_free+0x109/0x180 kmem_cache_free+0x191/0x460 bpf_prog_test_run_skb+0x341/0x1c70 ? __pfx_bpf_prog_test_run_skb+0x10/0x10 ? __fget_light+0x51/0x220 __sys_bpf+0xf2e/0x41b0 ? __might_fault+0xa2/0x170 ? __pfx___sys_bpf+0x10/0x10 ? lock_release+0x1de/0x620 ? __might_fault+0xcd/0x170 ? __pfx_lock_release+0x10/0x10 ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 __x64_sys_bpf+0x78/0xc0 ? syscall_enter_from_user_mode+0x20/0x50 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f0fbb38880d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 45 12 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe13907de8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 00007ffe13908708 RCX: 00007f0fbb38880d RDX: 0000000000000050 RSI: 00007ffe13907e20 RDI: 000000000000000a RBP: 00007ffe13907e00 R08: 0000000000000000 R09: 00007ffe13907e20 R10: 0000000000000064 R11: 0000000000000206 R12: 0000000000000003 R13: 0000000000000000 R14: 00007f0fbb532000 R15: 0000000000cfbd90 </TASK> The buggy address belongs to stack of task test_progs/227 KASAN internal error: frame info validation failed; invalid marker: 0 The buggy address belongs to the virtual mapping at [ffffc900013a8000, ffffc900013b1000) created by: kernel_clone+0xcd/0x600 The buggy address belongs to the physical page: page:00000000b70f4332 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11418f flags: 0x2fffe0000000000(node=0|zone=2|lastcpupid=0x7fff) page_type: 0xffffffff() raw: 02fffe0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffc900013af800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013af880: 00 00 00 f1 f1 f1 f1 00 00 00 f3 f3 f3 f3 f3 00 >ffffc900013af900: 00 00 00 00 00 00 00 00 00 00 00 f1 00 00 00 00 ^ ffffc900013af980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffc900013afa00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Disabling lock debugging due to kernel taint Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Treat first argument as return value for bpf_throwKumar Kartikeya Dwivedi
In case of the default exception callback, change the behavior of bpf_throw, where the passed cookie value is no longer ignored, but is instead the return value of the default exception callback. As such, we need to place restrictions on the value being passed into bpf_throw in such a case, only allowing those permitted by the check_return_code function. Thus, bpf_throw can now control the return value of the program from each call site without having the user install a custom exception callback just to override the return value when an exception is thrown. We also modify the hidden subprog instructions to now move BPF_REG_1 to BPF_REG_0, so as to set the return value before exit in the default callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Perform CFG walk for exception callbackKumar Kartikeya Dwivedi
Since exception callbacks are not referenced using bpf_pseudo_func and bpf_pseudo_call instructions, check_cfg traversal will never explore instructions of the exception callback. Even after adding the subprog, the program will then fail with a 'unreachable insn' error. We thus need to begin walking from the start of the exception callback again in check_cfg after a complete CFG traversal finishes, so as to explore the CFG rooted at the exception callback. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Add support for custom exception callbacksKumar Kartikeya Dwivedi
By default, the subprog generated by the verifier to handle a thrown exception hardcodes a return value of 0. To allow user-defined logic and modification of the return value when an exception is thrown, introduce the 'exception_callback:' declaration tag, which marks a callback as the default exception handler for the program. The format of the declaration tag is 'exception_callback:<value>', where <value> is the name of the exception callback. Each main program can be tagged using this BTF declaratiion tag to associate it with an exception callback. In case the tag is absent, the default callback is used. As such, the exception callback cannot be modified at runtime, only set during verification. Allowing modification of the callback for the current program execution at runtime leads to issues when the programs begin to nest, as any per-CPU state maintaing this information will have to be saved and restored. We don't want it to stay in bpf_prog_aux as this takes a global effect for all programs. An alternative solution is spilling the callback pointer at a known location on the program stack on entry, and then passing this location to bpf_throw as a parameter. However, since exceptions are geared more towards a use case where they are ideally never invoked, optimizing for this use case and adding to the complexity has diminishing returns. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Refactor check_btf_func and split into two phasesKumar Kartikeya Dwivedi
This patch splits the check_btf_info's check_btf_func check into two separate phases. The first phase sets up the BTF and prepares func_info, but does not perform any validation of required invariants for subprogs just yet. This is left to the second phase, which happens where check_btf_info executes currently, and performs the line_info and CO-RE relocation. The reason to perform this split is to obtain the userspace supplied func_info information before we perform the add_subprog call, where we would now require finding and adding subprogs that may not have a bpf_pseudo_call or bpf_pseudo_func instruction in the program. We require this as we want to enable userspace to supply exception callbacks that can override the default hidden subprogram generated by the verifier (which performs a hardcoded action). In such a case, the exception callback may never be referenced in an instruction, but will still be suitably annotated (by way of BTF declaration tags). For finding this exception callback, we would require the program's BTF information, and the supplied func_info information which maps BTF type IDs to subprograms. Since the exception callback won't actually be referenced through instructions, later checks in check_cfg and do_check_subprogs will not verify the subprog. This means that add_subprog needs to add them in the add_subprog_and_kfunc phase before we move forward, which is why the BTF and func_info are required at that point. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-16bpf: Implement BPF exceptionsKumar Kartikeya Dwivedi
This patch implements BPF exceptions, and introduces a bpf_throw kfunc to allow programs to throw exceptions during their execution at runtime. A bpf_throw invocation is treated as an immediate termination of the program, returning back to its caller within the kernel, unwinding all stack frames. This allows the program to simplify its implementation, by testing for runtime conditions which the verifier has no visibility into, and assert that they are true. In case they are not, the program can simply throw an exception from the other branch. BPF exceptions are explicitly *NOT* an unlikely slowpath error handling primitive, and this objective has guided design choices of the implementation of the them within the kernel (with the bulk of the cost for unwinding the stack offloaded to the bpf_throw kfunc). The implementation of this mechanism requires use of add_hidden_subprog mechanism introduced in the previous patch, which generates a couple of instructions to move R1 to R0 and exit. The JIT then rewrites the prologue of this subprog to take the stack pointer and frame pointer as inputs and reset the stack frame, popping all callee-saved registers saved by the main subprog. The bpf_throw function then walks the stack at runtime, and invokes this exception subprog with the stack and frame pointers as parameters. Reviewers must take note that currently the main program is made to save all callee-saved registers on x86_64 during entry into the program. This is because we must do an equivalent of a lightweight context switch when unwinding the stack, therefore we need the callee-saved registers of the caller of the BPF program to be able to return with a sane state. Note that we have to additionally handle r12, even though it is not used by the program, because when throwing the exception the program makes an entry into the kernel which could clobber r12 after saving it on the stack. To be able to preserve the value we received on program entry, we push r12 and restore it from the generated subprogram when unwinding the stack. For now, bpf_throw invocation fails when lingering resources or locks exist in that path of the program. In a future followup, bpf_throw will be extended to perform frame-by-frame unwinding to release lingering resources for each stack frame, removing this limitation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>