summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-01-03rcu-tasks: Improve comments explaining tasks_rcu_exit_srcu purposeFrederic Weisbecker
Make sure we don't need to look again into the depths of git blame in order not to miss a subtle part about how rcu-tasks is dealing with exiting tasks. Suggested-by: Boqun Feng <boqun.feng@gmail.com> Suggested-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu-tasks: Use accurate runstart time for RCU Tasks boot-time testingZqiang
Currently, test_rcu_tasks_callback() reads from the jiffies counter only once when this function is invoked. This introduces inaccuracies because of the latencies induced by the synchronize_rcu_tasks*() invocations. This commit therefore re-reads the jiffies counter at the beginning of each test, thus avoiding penalizing later tests for the latencies induced by earlier tests. Therefore, this commit at the start of each RCU Tasks test, re-fetch the jiffies time as the runstart time. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Update comment after the index flipPaul E. McKenney
Because there is not guaranteed to be a full memory barrier between the ->srcu_unlock_count increment of an srcu_read_unlock() and the ->srcu_lock_count increment of the next srcu_read_lock(), this next srcu_read_lock() is not guaranteed to see the effect of the index flip just prior to this comment. However, this next srcu_read_lock() will execute a full memory barrier, so the srcu_read_lock() after that is guaranteed to see that index flip. This guarantee is illustrated by the following diagram of events and the litmus test following that. ------------------------------------------------------------------------ READER UPDATER ------------- ---------- // idx is initially 0. srcu_flip() { smp_mb(); // RSCS srcu_read_unlock() { smp_mb(); idx++; // P smp_mb(); // QQ } srcu_readers_unlock_idx(0) { ,--counted------------ count all unlock[0]; // Q | unlock[0]++; // X } smp_mb(); srcu_read_lock() { READ(idx) = 0; ,---- count all lock[0]; // contributes imbalance of 1. lock[0]++; ----counted | smp_mb(); // PP } | } | | // RSCS not going to effect above scan | srcu_read_unlock() { | smp_mb(); | unlock[0]++; | } | / / srcu_read_lock() { | READ(idx); // Y -----cannot be counted because of P (has to sample idx as 1) lock[1]++; ... } ------------------------------------------------------------------------ This makes it similar to the store buffer pattern. Using X, Y, P and Q annotated above, we get: ------------------------------------------------------------------------ READER UPDATER X (write) P (write) smp_mb(); //PP smp_mb(); //QQ Y (read) Q (read) ------------------------------------------------------------------------ ASCII art courtesy of Joel Fernandes. Reported-by: Joel Fernandes <joel@joelfernandes.org> Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Yet more detail for srcu_readers_active_idx_check() commentsPaul E. McKenney
The comment in srcu_readers_active_idx_check() following the smp_mb() is out of date, hailing from a simpler time when preemption was disabled across the bulk of __srcu_read_lock(). The fact that preemption was disabled meant that the number of tasks that had fetched the old index but not yet incremented counters was limited by the number of CPUs. In our more complex modern times, the number of CPUs is no longer a limit. This commit therefore updates this comment, additionally giving more memory-ordering detail. [ paulmck: Apply Nt->Nc feedback from Joel Fernandes. ] Reported-by: Boqun Feng <boqun.feng@gmail.com> Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Reported-by: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reported-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Remove needless rcu_seq_done() check while holding read lockPingfan Liu
The srcu_gp_start_if_needed() function now read-holds the srcu_struct whose grace period is being started, which means that the corresponding SRCU grace period cannot end. This in turn means that the SRCU grace-period sequence number returned by rcu_seq_snap() cannot expire during this time. And that means that the calls to rcu_seq_done() in srcu_funnel_exp_start() and srcu_funnel_gp_start() can never return true. This commit therefore removes these rcu_seq_done() checks, but adds checks in kernels built with CONFIG_PROVE_RCU=y that splats if rcu_seq_done() does somehow return true. [ paulmck: Rearrange checks to handle kernels built with lockdep. ] Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: rcu@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Add test code for semaphore-like SRCU readersPaul E. McKenney
This commit adds trivial test code for srcu_down_read() and srcu_up_read(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Fix the comparision in srcu_invl_snp_seq()Pingfan Liu
A grace-period sequence number contains two fields: counter and state. SRCU_SNP_INIT_SEQ provides a guaranteed invalid value for grace-period sequence numbers in newly allocated srcu_node structures' ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields. The point of the comparison in srcu_invl_snp_seq() is not to detect invalid grace-period sequence numbers in general, but rather to detect a newly allocated srcu_node structure whose ->srcu_have_cbs[] and ->srcu_gp_seq_needed_exp fields need to be brought into line with the srcu_struct structure's ->srcu_gp_seq field. This commit therefore causes srcu_invl_snp_seq() to compare both fields of the specified grace-period sequence number. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <rcu@vger.kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Delegate work to the boot cpu if using SRCU_SIZE_SMALLPingfan Liu
Commit 994f706872e6 ("srcu: Make Tree SRCU able to operate without snp_node array") assumes that cpu 0 is always online. However, there really are situations when some other CPU is the boot CPU, for example, when booting a kdump kernel with the maxcpus=1 boot parameter. On PowerPC, the kdump kernel can hang as follows: ... [ 1.740036] systemd[1]: Hostname set to <xyz.com> [ 243.686240] INFO: task systemd:1 blocked for more than 122 seconds. [ 243.686264] Not tainted 6.1.0-rc1 #1 [ 243.686272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.686281] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 [ 243.686296] Call Trace: [ 243.686301] [c000000016657640] [c000000016657670] 0xc000000016657670 (unreliable) [ 243.686317] [c000000016657830] [c00000001001dec0] __switch_to+0x130/0x220 [ 243.686333] [c000000016657890] [c000000010f607b8] __schedule+0x1f8/0x580 [ 243.686347] [c000000016657940] [c000000010f60bb4] schedule+0x74/0x140 [ 243.686361] [c0000000166579b0] [c000000010f699b8] schedule_timeout+0x168/0x1c0 [ 243.686374] [c000000016657a80] [c000000010f61de8] __wait_for_common+0x148/0x360 [ 243.686387] [c000000016657b20] [c000000010176bb0] __flush_work.isra.0+0x1c0/0x3d0 [ 243.686401] [c000000016657bb0] [c0000000105f2768] fsnotify_wait_marks_destroyed+0x28/0x40 [ 243.686415] [c000000016657bd0] [c0000000105f21b8] fsnotify_destroy_group+0x68/0x160 [ 243.686428] [c000000016657c40] [c0000000105f6500] inotify_release+0x30/0xa0 [ 243.686440] [c000000016657cb0] [c0000000105751a8] __fput+0xc8/0x350 [ 243.686452] [c000000016657d00] [c00000001017d524] task_work_run+0xe4/0x170 [ 243.686464] [c000000016657d50] [c000000010020e94] do_notify_resume+0x134/0x140 [ 243.686478] [c000000016657d80] [c00000001002eb18] interrupt_exit_user_prepare_main+0x198/0x270 [ 243.686493] [c000000016657de0] [c00000001002ec60] syscall_exit_prepare+0x70/0x180 [ 243.686505] [c000000016657e10] [c00000001000bf7c] system_call_vectored_common+0xfc/0x280 [ 243.686520] --- interrupt: 3000 at 0x7fffa47d5ba4 [ 243.686528] NIP: 00007fffa47d5ba4 LR: 0000000000000000 CTR: 0000000000000000 [ 243.686538] REGS: c000000016657e80 TRAP: 3000 Not tainted (6.1.0-rc1) [ 243.686548] MSR: 800000000000d033 <SF,EE,PR,ME,IR,DR,RI,LE> CR: 42044440 XER: 00000000 [ 243.686572] IRQMASK: 0 [ 243.686572] GPR00: 0000000000000006 00007ffffa606710 00007fffa48e7200 0000000000000000 [ 243.686572] GPR04: 0000000000000002 000000000000000a 0000000000000000 0000000000000001 [ 243.686572] GPR08: 000001000c172dd0 0000000000000000 0000000000000000 0000000000000000 [ 243.686572] GPR12: 0000000000000000 00007fffa4ff4bc0 0000000000000000 0000000000000000 [ 243.686572] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 243.686572] GPR20: 0000000132dfdc50 000000000000000e 0000000000189375 0000000000000000 [ 243.686572] GPR24: 00007ffffa606ae0 0000000000000005 000001000c185490 000001000c172570 [ 243.686572] GPR28: 000001000c172990 000001000c184850 000001000c172e00 00007fffa4fedd98 [ 243.686683] NIP [00007fffa47d5ba4] 0x7fffa47d5ba4 [ 243.686691] LR [0000000000000000] 0x0 [ 243.686698] --- interrupt: 3000 [ 243.686708] INFO: task kworker/u16:1:24 blocked for more than 122 seconds. [ 243.686717] Not tainted 6.1.0-rc1 #1 [ 243.686724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.686733] task:kworker/u16:1 state:D stack:0 pid:24 ppid:2 flags:0x00000800 [ 243.686747] Workqueue: events_unbound fsnotify_mark_destroy_workfn [ 243.686758] Call Trace: [ 243.686762] [c0000000166736e0] [c00000004fd91000] 0xc00000004fd91000 (unreliable) [ 243.686775] [c0000000166738d0] [c00000001001dec0] __switch_to+0x130/0x220 [ 243.686788] [c000000016673930] [c000000010f607b8] __schedule+0x1f8/0x580 [ 243.686801] [c0000000166739e0] [c000000010f60bb4] schedule+0x74/0x140 [ 243.686814] [c000000016673a50] [c000000010f699b8] schedule_timeout+0x168/0x1c0 [ 243.686827] [c000000016673b20] [c000000010f61de8] __wait_for_common+0x148/0x360 [ 243.686840] [c000000016673bc0] [c000000010210840] __synchronize_srcu.part.0+0xa0/0xe0 [ 243.686855] [c000000016673c30] [c0000000105f2c64] fsnotify_mark_destroy_workfn+0xc4/0x1a0 [ 243.686868] [c000000016673ca0] [c000000010174ea8] process_one_work+0x2a8/0x570 [ 243.686882] [c000000016673d40] [c000000010175208] worker_thread+0x98/0x5e0 [ 243.686895] [c000000016673dc0] [c0000000101828d4] kthread+0x124/0x130 [ 243.686908] [c000000016673e10] [c00000001000cd40] ret_from_kernel_thread+0x5c/0x64 [ 366.566274] INFO: task systemd:1 blocked for more than 245 seconds. [ 366.566298] Not tainted 6.1.0-rc1 #1 [ 366.566305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 366.566314] task:systemd state:D stack:0 pid:1 ppid:0 flags:0x00042000 [ 366.566329] Call Trace: ... The above splat occurs because PowerPC really does use maxcpus=1 instead of nr_cpus=1 in the kernel command line. Consequently, the (quite possibly non-zero) kdump CPU is the only online CPU in the kdump kernel. SRCU unconditionally queues a sdp->work on cpu 0, for which no worker thread has been created, so sdp->work will be never executed and __synchronize_srcu() will never be completed. This commit therefore replaces CPU ID 0 with get_boot_cpu_id() in key places in Tree SRCU. Since the CPU indicated by get_boot_cpu_id() is guaranteed to be online, this avoids the above splat. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: rcu@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03srcu: Release early_srcu resources when no longer in useZqiang
Kernels built with the CONFIG_TREE_SRCU Kconfig option set and then booted with rcupdate.rcu_self_test=1 and srcutree.convert_to_big=1 will test Tree SRCU during early boot. The early_srcu structure's srcu_node array will be allocated when init_srcu_struct_fields() is invoked, but after the test completes this early_srcu structure will not be used. This commit therefore invokes cleanup_srcu_struct() to free that srcu_node structure. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Split ready for reclaim objects from a batchUladzislau Rezki (Sony)
This patch splits the lists of objects so as to avoid sending any through RCU that have already been queued for more than one grace period. These long-term-resident objects are immediately freed. The remaining short-term-resident objects are queued for later freeing using queue_rcu_work(). This change avoids delaying workqueue handlers with synchronize_rcu() invocations. Yes, workqueue handlers are designed to handle blocking, but avoiding blocking when unnecessary improves performance during low-memory situations. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Carefully reset number of objects in krcpUladzislau Rezki (Sony)
The schedule_delayed_monitor_work() function relies on the count of objects queued into any given kfree_rcu_cpu structure. This count is used to determine how quickly to schedule passing these objects to RCU. There are three pipes where pointers can be placed. When any pipe is offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero, which is wrong because the other pipes might still be non-empty. This commit therefore maintains per-pipe counters, and introduces a krc_count() helper to access the aggregate value of those counters. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Use READ_ONCE() when access to krcp->headUladzislau Rezki (Sony)
The need_offload_krc() function is now lock-free, which gives the compiler freedom to load old values from plain C-language loads from the kfree_rcu_cpu struture's ->head pointer. This commit therefore applied READ_ONCE() to these loads. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Use a polled API to speedup a reclaim processUladzislau Rezki (Sony)
Currently all objects placed into a batch wait for a full grace period to elapse after that batch is ready to send to RCU. However, this can unnecessarily delay freeing of the first objects that were added to the batch. After all, several RCU grace periods might have elapsed since those objects were added, and if so, there is no point in further deferring their freeing. This commit therefore adds per-page grace-period snapshots which are obtained from get_state_synchronize_rcu(). When the batch is ready to be passed to call_rcu(), each page's snapshot is checked by passing it to poll_state_synchronize_rcu(). If a given page's RCU grace period has already elapsed, its objects are freed immediately by kvfree_rcu_bulk(). Otherwise, these objects are freed after a call to synchronize_rcu(). This approach requires that the pages be traversed in reverse order, that is, the oldest ones first. Test example: kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \ --kconfig CONFIG_NR_CPUS=64 \ --kconfig CONFIG_RCU_NOCB_CPU=y \ --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \ --kconfig CONFIG_RCU_LAZY=n \ --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \ rcuscale.holdoff=20 rcuscale.kfree_loops=10000 \ torture.disable_onoff_at_boot" --trust-make Before this commit: Total time taken by all kfree'ers: 8535693700 ns, loops: 10000, batches: 1188, memory footprint: 2248MB Total time taken by all kfree'ers: 8466933582 ns, loops: 10000, batches: 1157, memory footprint: 2820MB Total time taken by all kfree'ers: 5375602446 ns, loops: 10000, batches: 1130, memory footprint: 6502MB Total time taken by all kfree'ers: 7523283832 ns, loops: 10000, batches: 1006, memory footprint: 3343MB Total time taken by all kfree'ers: 6459171956 ns, loops: 10000, batches: 1150, memory footprint: 6549MB After this commit: Total time taken by all kfree'ers: 8560060176 ns, loops: 10000, batches: 1787, memory footprint: 61MB Total time taken by all kfree'ers: 8573885501 ns, loops: 10000, batches: 1777, memory footprint: 93MB Total time taken by all kfree'ers: 8320000202 ns, loops: 10000, batches: 1727, memory footprint: 66MB Total time taken by all kfree'ers: 8552718794 ns, loops: 10000, batches: 1790, memory footprint: 75MB Total time taken by all kfree'ers: 8601368792 ns, loops: 10000, batches: 1724, memory footprint: 62MB The reduction in memory footprint is well in excess of an order of magnitude. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Move need_offload_krc() out of krcp->lockUladzislau Rezki (Sony)
The need_offload_krc() function currently holds the krcp->lock in order to safely check krcp->head. This commit removes the need for this lock in that function by updating the krcp->head pointer using WRITE_ONCE() macro so that readers can carry out lockless loads of that pointer. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Move bulk/list reclaim to separate functionsUladzislau Rezki (Sony)
The kvfree_rcu() code maintains lists of pages of pointers, but also a singly linked list, with the latter being used when memory allocation fails. Traversal of these two types of lists is currently open coded. This commit simplifies the code by providing kvfree_rcu_bulk() and kvfree_rcu_list() functions, respectively, to traverse these two types of lists. This patch does not introduce any functional change. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Switch to a generic linked list APIUladzislau Rezki (Sony)
This commit improves the readability and maintainability of the kvfree_rcu() code by switching from an open-coded linked list to the standard Linux-kernel circular doubly linked list. This patch does not introduce any functional change. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Refactor kvfree_call_rcu() and high-level helpersUladzislau Rezki (Sony)
Currently a kvfree_call_rcu() takes an offset within a structure as a second parameter, so a helper such as a kvfree_rcu_arg_2() has to convert rcu_head and a freed ptr to an offset in order to pass it. That leads to an extra conversion on macro entry. Instead of converting, refactor the code in way that a pointer that has to be freed is passed directly to the kvfree_call_rcu(). This patch does not make any functional change and is transparent to all kvfree_rcu() users. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Allow expedited RCU CPU stall warnings to dump task stacksPaul E. McKenney
This commit introduces the rcupdate.rcu_exp_stall_task_details kernel boot parameter, which cause expedited RCU CPU stall warnings to dump the stacks of any tasks blocking the current expedited grace period. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Test synchronous RCU grace periods at the end of rcu_init()Paul E. McKenney
This commit tests synchronize_rcu() and synchronize_rcu_expedited() at the end of rcu_init(), in addition to the test already at the beginning of that function. These tests are run only in kernels built with CONFIG_PROVE_RCU=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Make rcu_blocking_is_gp() stop early-boot might_sleep()Zqiang
Currently, rcu_blocking_is_gp() invokes might_sleep() even during early boot when interrupts are disabled and before the scheduler is scheduling. This is at best an accident waiting to happen. Therefore, this commit moves that might_sleep() under an rcu_scheduler_active check in order to ensure that might_sleep() is not invoked unless sleeping might actually happen. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()Paul E. McKenney
The normal grace period's RCU CPU stall warnings are invoked from the scheduling-clock interrupt handler, and can thus invoke smp_processor_id() with impunity, which allows them to directly invoke dump_cpu_task(). In contrast, the expedited grace period's RCU CPU stall warnings are invoked from process context, which causes the dump_cpu_task() function's calls to smp_processor_id() to complain bitterly in debug kernels. This commit therefore causes synchronize_rcu_expedited_wait() to disable preemption around its call to dump_cpu_task(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Upgrade header comment for poll_state_synchronize_rcu()Paul E. McKenney
This commit emphasizes the possibility of concurrent calls to synchronize_rcu() and synchronize_rcu_expedited() causing one or the other of the two grace periods being lost from the viewpoint of poll_state_synchronize_rcu(). If you cannot afford to lose grace periods this way, you should instead use the _full() variants of the polled RCU API, for example, poll_state_synchronize_rcu_full(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Throttle callback invocation based on number of ready callbacksPaul E. McKenney
Currently, rcu_do_batch() sizes its batches based on the total number of callbacks in the callback list. This can result in some strange choices, for example, if there was 12,800 callbacks in the list, but only 200 were ready to invoke, RCU would invoke 100 at a time (12,800 shifted down by seven bits). A more measured approach would use the number that were actually ready to invoke, an approach that has become feasible only recently given the per-segment ->seglen counts in ->cblist. This commit therefore bases the batch limit on the number of callbacks ready to invoke instead of on the total number of callbacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Consolidate initialization and CPU-hotplug codePaul E. McKenney
This commit consolidates the initialization and CPU-hotplug code at the end of kernel/rcu/tree.c. This is strictly a code-motion commit. No functionality has changed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03time: Fix various kernel-doc problemsRandy Dunlap
Clean up kernel-doc complaints about function names and non-kernel-doc comments in kernel/time/. Fixes these warnings: kernel/time/time.c:479: warning: expecting prototype for set_normalized_timespec(). Prototype was for set_normalized_timespec64() instead kernel/time/time.c:553: warning: expecting prototype for msecs_to_jiffies(). Prototype was for __msecs_to_jiffies() instead kernel/time/timekeeping.c:1595: warning: contents before sections kernel/time/timekeeping.c:1705: warning: This comment starts with '/**', but isn't a kernel-doc comment. * We have three kinds of time sources to use for sleep time kernel/time/timekeeping.c:1726: warning: This comment starts with '/**', but isn't a kernel-doc comment. * 1) can be determined whether to use or not only when doing kernel/time/tick-oneshot.c:21: warning: missing initial short description on line: * tick_program_event kernel/time/tick-oneshot.c:107: warning: expecting prototype for tick_check_oneshot_mode(). Prototype was for tick_oneshot_mode_active() instead Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230103032849.12723-1-rdunlap@infradead.org
2023-01-02kcsan: test: don't put the expect array on the stackMax Filippov
Size of the 'expect' array in the __report_matches is 1536 bytes, which is exactly the default frame size warning limit of the xtensa architecture. As a result allmodconfig xtensa kernel builds with the gcc that does not support the compiler plugins (which otherwise would push the said warning limit to 2K) fail with the following message: kernel/kcsan/kcsan_test.c:257:1: error: the frame size of 1680 bytes is larger than 1536 bytes Fix it by dynamically allocating the 'expect' array. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com>
2023-01-02sched/rseq: Fix concurrency ID handling of usermodehelper kthreadsMathieu Desnoyers
sched_mm_cid_after_execve() does not expect NULL t->mm, but it may happen if a usermodehelper kthread fails when attempting to execute a binary. sched_mm_cid_fork() can be issued from a usermodehelper kthread, which has t->flags PF_KTHREAD set. Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: kernel test robot <yujie.liu@intel.com> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/oe-lkp/202212301353.5c959d72-yujie.liu@intel.com
2023-01-01Merge tag 'perf_urgent_for_v6.2_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Pass only an initialized perf event attribute to the LSM hook - Fix a use-after-free on the perf syscall's error path - A potential integer overflow fix in amd_core_pmu_init() - Fix the cgroup events tracking after the context handling rewrite - Return the proper value from the inherit_event() function on error * tag 'perf_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Call LSM hook after copying perf_event_attr perf: Fix use-after-free in error path perf/x86/amd: fix potential integer overflow on shift of a int perf/core: Fix cgroup events tracking perf core: Return error pointer if inherit_event() fails to find pmu_ctx
2023-01-01Merge tag 'locking_urgent_for_v6.2_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Prevent the leaking of a debug timer in futex_waitv() - A preempt-RT mutex locking fix, adding the proper acquire semantics * tag 'locking_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Fix futex_waitv() hrtimer debug object leak on kcalloc error rtmutex: Add acquire semantics for rtmutex lock acquisition slow path
2022-12-28bpf: rename list_head -> graph_root in field info typesDave Marchevsky
Many of the structs recently added to track field info for linked-list head are useful as-is for rbtree root. So let's do a mechanical renaming of list_head-related types and fields: include/linux/bpf.h: struct btf_field_list_head -> struct btf_field_graph_root list_head -> graph_root in struct btf_field union kernel/bpf/btf.c: list_head -> graph_root in struct btf_field_info This is a nonfunctional change, functionality to actually use these fields for rbtree will be added in further patches. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20221217082506.1570898-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Always use maximal size for copy_array()Kees Cook
Instead of counting on prior allocations to have sized allocations to the next kmalloc bucket size, always perform a krealloc that is at least ksize(dst) in size (which is a no-op), so the size can be correctly tracked by all the various allocation size trackers (KASAN, __alloc_size, etc). Reported-by: Hyunwoo Kim <v4bel@theori.io> Link: https://lore.kernel.org/bpf/20221223094551.GA1439509@ubuntu Fixes: ceb35b666d42 ("bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage") Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: bpf@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221223182836.never.866-kees@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: keep a reference to the mm, in case the task is dead.Kui-Feng Lee
Fix the system crash that happens when a task iterator travel through vma of tasks. In task iterators, we used to access mm by following the pointer on the task_struct; however, the death of a task will clear the pointer, even though we still hold the task_struct. That can cause an unexpected crash for a null pointer when an iterator is visiting a task that dies during the visit. Keeping a reference of mm on the iterator ensures we always have a valid pointer to mm. Co-developed-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Reported-by: Nathan Slingerland <slinger@meta.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-28bpf: Fix panic due to wrong pageattr of im->imageChuang Wang
In the scenario where livepatch and kretfunc coexist, the pageattr of im->image is rox after arch_prepare_bpf_trampoline in bpf_trampoline_update, and then modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the BPF_TRAMP_F_ORIG_STACK flag will be configured, and arch_prepare_bpf_trampoline will be re-executed. At this time, because the pageattr of im->image is rox, arch_prepare_bpf_trampoline will read and write im->image, which causes a fault. as follows: insmod livepatch-sample.ko # samples/livepatch/livepatch-sample.c bpftrace -e 'kretfunc:cmdline_proc_show {}' BUG: unable to handle page fault for address: ffffffffa0206000 PGD 322d067 P4D 322d067 PUD 322e063 PMD 1297e067 PTE d428061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 2 PID: 270 Comm: bpftrace Tainted: G E K 6.1.0 #5 RIP: 0010:arch_prepare_bpf_trampoline+0xed/0x8c0 RSP: 0018:ffffc90001083ad8 EFLAGS: 00010202 RAX: ffffffffa0206000 RBX: 0000000000000020 RCX: 0000000000000000 RDX: ffffffffa0206001 RSI: ffffffffa0206000 RDI: 0000000000000030 RBP: ffffc90001083b70 R08: 0000000000000066 R09: ffff88800f51b400 R10: 000000002e72c6e5 R11: 00000000d0a15080 R12: ffff8880110a68c8 R13: 0000000000000000 R14: ffff88800f51b400 R15: ffffffff814fec10 FS: 00007f87bc0dc780(0000) GS:ffff88803e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0206000 CR3: 0000000010b70000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bpf_trampoline_update+0x25a/0x6b0 __bpf_trampoline_link_prog+0x101/0x240 bpf_trampoline_link_prog+0x2d/0x50 bpf_tracing_prog_attach+0x24c/0x530 bpf_raw_tp_link_attach+0x73/0x1d0 __sys_bpf+0x100e/0x2570 __x64_sys_bpf+0x1c/0x30 do_syscall_64+0x5b/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd With this patch, when modify_fentry or register_fentry returns -EAGAIN from bpf_tramp_ftrace_ops_func, the pageattr of im->image will be reset to nx+rw. Cc: stable@vger.kernel.org Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20221224133146.780578-1-nashuiliang@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: fix regs_exact() logic in regsafe() to remap IDs correctlyAndrii Nakryiko
Comparing IDs exactly between two separate states is not just suboptimal, but also incorrect in some cases. So update regs_exact() check to do byte-by-byte memcmp() only up to id/ref_obj_id. For id and ref_obj_id perform proper check_ids() checks, taking into account idmap. This change makes more states equivalent improving insns and states stats across a bunch of selftest BPF programs: File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- cgrp_kfunc_success.bpf.linked1.o test_cgrp_get_release 141 137 -4 (-2.84%) 13 13 +0 (+0.00%) cgrp_kfunc_success.bpf.linked1.o test_cgrp_xchg_release 142 139 -3 (-2.11%) 14 13 -1 (-7.14%) connect6_prog.bpf.linked1.o connect_v6_prog 139 102 -37 (-26.62%) 9 6 -3 (-33.33%) ima.bpf.linked1.o bprm_creds_for_exec 68 61 -7 (-10.29%) 6 5 -1 (-16.67%) linked_list.bpf.linked1.o global_list_in_list 569 499 -70 (-12.30%) 60 52 -8 (-13.33%) linked_list.bpf.linked1.o global_list_push_pop 167 150 -17 (-10.18%) 18 16 -2 (-11.11%) linked_list.bpf.linked1.o global_list_push_pop_multiple 881 815 -66 (-7.49%) 74 63 -11 (-14.86%) linked_list.bpf.linked1.o inner_map_list_in_list 579 534 -45 (-7.77%) 61 55 -6 (-9.84%) linked_list.bpf.linked1.o inner_map_list_push_pop 190 181 -9 (-4.74%) 19 18 -1 (-5.26%) linked_list.bpf.linked1.o inner_map_list_push_pop_multiple 916 850 -66 (-7.21%) 75 64 -11 (-14.67%) linked_list.bpf.linked1.o map_list_in_list 588 525 -63 (-10.71%) 62 55 -7 (-11.29%) linked_list.bpf.linked1.o map_list_push_pop 183 174 -9 (-4.92%) 18 17 -1 (-5.56%) linked_list.bpf.linked1.o map_list_push_pop_multiple 909 843 -66 (-7.26%) 75 64 -11 (-14.67%) map_kptr.bpf.linked1.o test_map_kptr 264 256 -8 (-3.03%) 26 26 +0 (+0.00%) map_kptr.bpf.linked1.o test_map_kptr_ref 95 91 -4 (-4.21%) 9 8 -1 (-11.11%) task_kfunc_success.bpf.linked1.o test_task_xchg_release 139 136 -3 (-2.16%) 14 13 -1 (-7.14%) test_bpf_nf.bpf.linked1.o nf_skb_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_bpf_nf.bpf.linked1.o nf_xdp_ct_test 815 509 -306 (-37.55%) 57 30 -27 (-47.37%) test_cls_redirect.bpf.linked1.o cls_redirect 78925 78390 -535 (-0.68%) 4782 4704 -78 (-1.63%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 64901 63897 -1004 (-1.55%) 4612 4470 -142 (-3.08%) test_sk_lookup.bpf.linked1.o access_ctx_sk 181 95 -86 (-47.51%) 19 10 -9 (-47.37%) test_sk_lookup.bpf.linked1.o ctx_narrow_access 447 437 -10 (-2.24%) 38 37 -1 (-2.63%) test_sk_lookup_kern.bpf.linked1.o sk_lookup_success 148 133 -15 (-10.14%) 14 12 -2 (-14.29%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_clsact 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_tcp_check_syncookie_kern.bpf.linked1.o check_syncookie_xdp 304 300 -4 (-1.32%) 23 22 -1 (-4.35%) test_verify_pkcs7_sig.bpf.linked1.o bpf 87 76 -11 (-12.64%) 7 6 -1 (-14.29%) ------------------------------------------- -------------------------------- --------- --------- -------------- ---------- ---------- ------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: perform byte-by-byte comparison only when necessary in regsafe()Andrii Nakryiko
Extract byte-by-byte comparison of bpf_reg_state in regsafe() into a helper function, which makes it more convenient to use it "on demand" only for registers that benefit from such checks, instead of doing it all the time, even if result of such comparison is ignored. Also, remove WARN_ON_ONCE(1)+return false dead code. There is no risk of missing some case as compiler will warn about non-void function not returning value in some branches (and that under assumption that default case is removed in the future). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: reject non-exact register type matches in regsafe()Andrii Nakryiko
Generalize the (somewhat implicit) rule of regsafe(), which states that if register types in old and current states do not match *exactly*, they can't be safely considered equivalent. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: generalize MAYBE_NULL vs non-MAYBE_NULL ruleAndrii Nakryiko
Make generic check to prevent XXX_OR_NULL and XXX register types to be intermixed. While technically in some situations it could be safe, it's impossible to enforce due to the loss of an ID when converting XXX_OR_NULL to its non-NULL variant. So prevent this in general, not just for PTR_TO_MAP_KEY and PTR_TO_MAP_VALUE. PTR_TO_MAP_KEY_OR_NULL and PTR_TO_MAP_VALUE_OR_NULL checks, which were previously special-cased, are simplified to generic check that takes into account range_within() and tnum_in(). This is correct as BPF verifier doesn't allow arithmetic on XXX_OR_NULL register types, so var_off and ranges should stay zero. But even if in the future this restriction is lifted, it's even more important to enforce that var_off and ranges are compatible, otherwise it's possible to construct case where this can be exploited to bypass verifier's memory range safety checks. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: reorganize struct bpf_reg_state fieldsAndrii Nakryiko
Move id and ref_obj_id fields after scalar data section (var_off and ranges). This is necessary to simplify next patch which will change regsafe()'s logic to be safer, as it makes the contents that has to be an exact match (type-specific parts, off, type, and var_off+ranges) a single sequential block of memory, while id and ref_obj_id should always be remapped and thus can't be memcp()'ed. There are few places that assume that var_off is after id/ref_obj_id to clear out id/ref_obj_id with the single memset(0). These are changed to explicitly zero-out id/ref_obj_id fields. Other places are adjusted to preserve exact byte-by-byte comparison behavior. No functional changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27bpf: teach refsafe() to take into account ID remappingAndrii Nakryiko
states_equal() check performs ID mapping between old and new states to establish a 1-to-1 correspondence between IDs, even if their absolute numberic values across two equivalent states differ. This is important both for correctness and to avoid unnecessary work when two states are equivalent. With recent changes we partially fixed this logic by maintaining ID map across all function frames. This patch also makes refsafe() check take into account (and maintain) ID map, making states_equal() behavior more optimal and correct. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221223054921.958283-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-12-27cputime: remove cputime_to_nsecs fallbackNicholas Piggin
The archs that use cputime_to_nsecs() internally provide their own definition and don't need the fallback. cputime_to_usecs() unused except in this fallback, and is not defined anywhere. This removes the final remnant of the cputime_t code from the kernel. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
2022-12-27sched/core: Adjusting the order of scanning CPUHao Jia
When select_idle_capacity() starts scanning for an idle CPU, it starts with target CPU that has already been checked in select_idle_sibling(). So we start checking from the next CPU and try the target CPU at the end. Similarly for task_numa_assign(), we have just checked numa_migrate_on of dst_cpu, so start from the next CPU. This also works for steal_cookie_task(), the first scan must fail and start directly from the next one. Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-3-jiahao.os@bytedance.com
2022-12-27sched/numa: Stop an exhastive search if an idle core is foundHao Jia
In update_numa_stats() we try to find an idle cpu on the NUMA node, preferably an idle core. we can stop looking for the next idle core or idle cpu after finding an idle core. But we can't stop the whole loop of scanning the CPU, because we need to calculate approximate NUMA stats at a point in time. For example, the src and dst nr_running is needed by task_numa_find_cpu(). Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20221216062406.7812-2-jiahao.os@bytedance.com
2022-12-27sched: Make const-safeMatthew Wilcox (Oracle)
With a modified container_of() that preserves constness, the compiler finds some pointers which should have been marked as const. task_of() also needs to become const-preserving for the !FAIR_GROUP_SCHED case so that cfs_rq_of() can take a const argument. No change to generated code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221212144946.2657785-1-willy@infradead.org
2022-12-27rseq: Extend struct rseq with per-memory-map concurrency IDMathieu Desnoyers
If a memory map has fewer threads than there are cores on the system, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-9-mathieu.desnoyers@efficios.com
2022-12-27sched: Introduce per-memory-map concurrency IDMathieu Desnoyers
This feature allows the scheduler to expose a per-memory map concurrency ID to user-space. This concurrency ID is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory map. If a memory map has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container. One of the key concern from scheduler maintainers is the overhead associated with additional spin locks or atomic operations in the scheduler fast-path. This is why the following optimization is implemented. On context switch between threads belonging to the same memory map, transfer the mm_cid from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory map. Additional optimizations can be done if the spin locks added when context switching between threads belonging to different memory maps end up being a performance bottleneck. Those are left out of this patch though. A performance impact would have to be clearly demonstrated to justify the added complexity. The credit goes to Paul Turner (Google) for the original virtual cpu id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. The following benchmarks do not show any significant overhead added to the scheduler context switch by this feature: * perf bench sched messaging (process) Baseline: 86.5±0.3 ms With mm_cid: 86.7±2.6 ms * perf bench sched messaging (threaded) Baseline: 84.3±3.0 ms With mm_cid: 84.7±2.6 ms * hackbench (process) Baseline: 82.9±2.7 ms With mm_cid: 82.9±2.9 ms * hackbench (threaded) Baseline: 85.2±2.6 ms With mm_cid: 84.4±2.9 ms [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
2022-12-27rseq: Extend struct rseq with numa node idMathieu Desnoyers
Adding the NUMA node id to struct rseq is a straightforward thing to do, and a good way to figure out if anything in the user-space ecosystem prevents extending struct rseq. This NUMA node id field allows memory allocators such as tcmalloc to take advantage of fast access to the current NUMA node id to perform NUMA-aware memory allocation. It can also be useful for implementing fast-paths for NUMA-aware user-space mutexes. It also allows implementing getcpu(2) purely in user-space. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-5-mathieu.desnoyers@efficios.com
2022-12-27rseq: Introduce extensible rseq ABIMathieu Desnoyers
Introduce the extensible rseq ABI, where the feature size supported by the kernel and the required alignment are communicated to user-space through ELF auxiliary vectors. This allows user-space to call rseq registration with a rseq_len of either 32 bytes for the original struct rseq size (which includes padding), or larger. If rseq_len is larger than 32 bytes, then it must be large enough to contain the feature size communicated to user-space through ELF auxiliary vectors. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-4-mathieu.desnoyers@efficios.com
2022-12-27sched: Async unthrottling for cfs bandwidthJosh Don
CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's inline in an hrtimer callback. Runtime distribution is a per-cpu operation, and unthrottling is a per-cgroup operation, since a tg walk is required. On machines with a large number of cpus and large cgroup hierarchies, this cpus*cgroups work can be too much to do in a single hrtimer callback: since IRQ are disabled, hard lockups may easily occur. Specifically, we've found this scalability issue on configurations with 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high memory bandwidth usage. To fix this, we can instead unthrottle cfs_rq's asynchronously via a CSD. Each cpu is responsible for unthrottling itself, thus sharding the total work more fairly across the system, and avoiding hard lockups. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221117005418.3499691-1-joshdon@google.com
2022-12-27sched/topology: Add __init for init_defrootdomainBing Huang
init_defrootdomain is only used in initialization Signed-off-by: Bing Huang <huangbing@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20221118034208.267330-1-huangbing775@126.com
2022-12-27futex: Fix futex_waitv() hrtimer debug object leak on kcalloc errorMathieu Desnoyers
In a scenario where kcalloc() fails to allocate memory, the futex_waitv system call immediately returns -ENOMEM without invoking destroy_hrtimer_on_stack(). When CONFIG_DEBUG_OBJECTS_TIMERS=y, this results in leaking a timer debug object. Fixes: bf69bad38cf6 ("futex: Implement sys_futex_waitv()") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Cc: stable@vger.kernel.org # v5.16+ Link: https://lore.kernel.org/r/20221214222008.200393-1-mathieu.desnoyers@efficios.com