summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-03-12Merge tag 'audit-pr-20240312' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Two small audit patches: - Use the KMEM_CACHE() macro instead of kmem_cache_create() The guidance appears to be to use the KMEM_CACHE() macro when possible and there is no reason why we can't use the macro, so let's use it. - Remove an unnecessary assignment in audit_dupe_lsm_field() A return value variable was assigned a value in its declaration, but the declaration value is overwritten before the return value variable is ever referenced; drop the assignment at declaration time" * tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: use KMEM_CACHE() instead of kmem_cache_create() audit: remove unnecessary assignment in audit_dupe_lsm_field()
2024-03-12Merge tag 'hardening-v6.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As is pretty normal for this tree, there are changes all over the place, especially for small fixes, selftest improvements, and improved macro usability. Some header changes ended up landing via this tree as they depended on the string header cleanups. Also, a notable set of changes is the work for the reintroduction of the UBSAN signed integer overflow sanitizer so that we can continue to make improvements on the compiler side to make this sanitizer a more viable future security hardening option. Summary: - string.h and related header cleanups (Tanzir Hasan, Andy Shevchenko) - VMCI memcpy() usage and struct_size() cleanups (Vasiliy Kovalev, Harshit Mogalapalli) - selftests/powerpc: Fix load_unaligned_zeropad build failure (Michael Ellerman) - hardened Kconfig fragment updates (Marco Elver, Lukas Bulwahn) - Handle tail call optimization better in LKDTM (Douglas Anderson) - Use long form types in overflow.h (Andy Shevchenko) - Add flags param to string_get_size() (Andy Shevchenko) - Add Coccinelle script for potential struct_size() use (Jacob Keller) - Fix objtool corner case under KCFI (Josh Poimboeuf) - Drop 13 year old backward compat CAP_SYS_ADMIN check (Jingzi Meng) - Add str_plural() helper (Michal Wajdeczko, Kees Cook) - Ignore relocations in .notes section - Add comments to explain how __is_constexpr() works - Fix m68k stack alignment expectations in stackinit Kunit test - Convert string selftests to KUnit - Add KUnit tests for fortified string functions - Improve reporting during fortified string warnings - Allow non-type arg to type_max() and type_min() - Allow strscpy() to be called with only 2 arguments - Add binary mode to leaking_addresses scanner - Various small cleanups to leaking_addresses scanner - Adding wrapping_*() arithmetic helper - Annotate initial signed integer wrap-around in refcount_t - Add explicit UBSAN section to MAINTAINERS - Fix UBSAN self-test warnings - Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL - Reintroduce UBSAN's signed overflow sanitizer" * tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits) selftests/powerpc: Fix load_unaligned_zeropad build failure string: Convert helpers selftest to KUnit string: Convert selftest to KUnit sh: Fix build with CONFIG_UBSAN=y compiler.h: Explain how __is_constexpr() works overflow: Allow non-type arg to type_max() and type_min() VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler() lib/string_helpers: Add flags param to string_get_size() x86, relocs: Ignore relocations in .notes section objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks overflow: Use POD in check_shl_overflow() lib: stackinit: Adjust target string to 8 bytes for m68k sparc: vdso: Disable UBSAN instrumentation kernel.h: Move lib/cmdline.c prototypes to string.h leaking_addresses: Provide mechanism to scan binary files leaking_addresses: Ignore input device status lines leaking_addresses: Use File::Temp for /tmp files MAINTAINERS: Update LEAKING_ADDRESSES details fortify: Improve buffer overflow reporting fortify: Add KUnit tests for runtime overflows ...
2024-03-12watchdog/core: remove sysctl handlers from public headerThomas Weißschuh
The functions are only used in the file where they are defined. Remove them from the header and make them static. Also guard proc_soft_watchdog with a #define-guard as it is not used otherwise. Link: https://lkml.kernel.org/r/20240306-const-sysctl-prep-watchdog-v1-1-bd45da3a41cf@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-12tracing/ring-buffer: Fix wait_on_pipe() raceSteven Rostedt (Google)
When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 ----- ----- wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.557950713@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()Steven Rostedt (Google)
Convert ring_buffer_wait() over to wait_event_interruptible(). The default condition is to execute the wait loop inside __wait_event() just once. This does not change the ring_buffer_wait() prototype yet, but restructures the code so that it can take a "cond" and "data" parameter and will call wait_event_interruptible() with a helper function as the condition. The helper function (rb_wait_cond) takes the cond function and data parameters. It will first check if the buffer hit the watermark defined by the "full" parameter and then call the passed in condition parameter. If either are true, it returns true. If rb_wait_cond() does not return true, it will set the appropriate "waiters_pending" flag and returns false. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.399598519@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Reuse rb_watermark_hit() for the poll logicSteven Rostedt (Google)
The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.802267543@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Fix full_waiters_pending in pollSteven Rostedt (Google)
If a reader of the ring buffer is doing a poll, and waiting for the ring buffer to hit a specific watermark, there could be a case where it gets into an infinite ping-pong loop. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 ----- ----- [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] In the poll, the shortest_full needs to be set before the full_pending_waiters, as once that is set, the writer will compare the current shortest_full (which is incorrect) to decide to call the irq_work, which will reset the shortest_full (expecting the readers to update it). Also move the setting of full_waiters_pending after the check if the ring buffer has the required percentage filled. There's no reason to tell the writer to wake up waiters if there are no waiters. Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.630922155@goodmis.org Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Do not set shortest_full when full target is hitSteven Rostedt (Google)
The rb_watermark_hit() checks if the amount of data in the ring buffer is above the percentage level passed in by the "full" variable. If it is, it returns true. But it also sets the "shortest_full" field of the cpu_buffer that informs writers that it needs to call the irq_work if the amount of data on the ring buffer is above the requested amount. The rb_watermark_hit() always sets the shortest_full even if the amount in the ring buffer is what it wants. As it is not going to wait, because it has what it wants, there's no reason to set shortest_full. Link: https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8ba08@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12entry: Respect changes to system call number by trace_sys_enter()André Rösti
When a probe is registered at the trace_sys_enter() tracepoint, and that probe changes the system call number, the old system call still gets executed. This worked correctly until commit b6ec41346103 ("core/entry: Report syscall correctly for trace and audit"), which removed the re-evaluation of the syscall number after the trace point. Restore the original semantics by re-evaluating the system call number after trace_sys_enter(). The performance impact of this re-evaluation is minimal because it only takes place when a trace point is active, and compared to the actual trace point overhead the read from a cache hot variable is negligible. Fixes: b6ec41346103 ("core/entry: Report syscall correctly for trace and audit") Signed-off-by: André Rösti <an.roesti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240311211704.7262-1-an.roesti@gmail.com
2024-03-12sched/balancing: Fix a couple of outdated function names in commentsIngo Molnar
The 'idle_balance()' function hasn't existed for years, and there's no load_balance_newidle() either - both are sched_balance_newidle() today. Reported-by: Honglei Wang <jameshongleiwang@126.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/ZfAwNufbiyt/5biu@gmail.com
2024-03-12sched/balancing: Rename find_idlest_cpu() => sched_balance_find_dst_cpu()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Also use 'dst' instead of 'idlest', because it's not really true that we return the 'idlest' group or CPU, we sort by idle-exit latency and only return the idlest CPUs from the lowest-latency set of CPUs. The true 'idlest' CPUs often remain idle for a long time and are never returned as long as the system is under-loaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-14-mingo@kernel.org
2024-03-12sched/balancing: Rename find_idlest_group() => sched_balance_find_dst_group()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Also use 'dst' instead of 'idlest', because it's not really true that we return the 'idlest' group or CPU, we sort by idle-exit latency and only return the idlest CPUs from the lowest-latency set of CPUs. The true 'idlest' CPUs often remain idle for a long time and are never returned as long as the system is under-loaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-13-mingo@kernel.org
2024-03-12sched/balancing: Rename find_idlest_group_cpu() => ↵Ingo Molnar
sched_balance_find_dst_group_cpu() Standardize scheduler load-balancing function names on the sched_balance_() prefix. Also use 'dst' instead of 'idlest': while historically correct, today it's not really true anymore that we return the 'idlest' group or CPU, we sort by idle-exit latency and only return the idlest CPUs from the lowest-latency set of CPUs. The true 'idlest' CPUs often remain idle for a long time and are never returned as long as the system is under-loaded. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-12-mingo@kernel.org
2024-03-12sched/balancing: Rename newidle_balance() => sched_balance_newidle()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-11-mingo@kernel.org
2024-03-12sched/balancing: Rename update_blocked_averages() => ↵Ingo Molnar
sched_balance_update_blocked_averages() Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-10-mingo@kernel.org
2024-03-12sched/balancing: Rename find_busiest_group() => sched_balance_find_src_group()Ingo Molnar
Make two naming changes: 1) Standardize scheduler load-balancing function names on the sched_balance_() prefix. 2) Similar to find_busiest_queue(), the find_busiest_group() naming has become a bit of a misnomer: the 'busiest' qualifier to this function was historically correct but in the current code in quite a few cases we will not pick the 'busiest' group - but the best (possible) group we can balance from based on a complex set of constraints. So name it a bit more neutrally, similar to the 'src/dst' nomenclature we are already using when moving tasks between runqueues, and also use the sched_balance_ prefix: sched_balance_find_src_group(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-9-mingo@kernel.org
2024-03-12sched/balancing: Rename find_busiest_queue() => sched_balance_find_src_rq()Ingo Molnar
The find_busiest_queue() naming has two small quirks: - Scheduler functions that deal with runqueues usually have a rq_ prefix or _rq postfix, but this function has neither. - Plus the 'busiest' qualifier to this function was historically correct, but has become somewhat of a misnomer: in quite a few cases we will not pick the busiest runqueue - but the best (possible) runqueue we can balance tasks from. So name it a bit more neutrally, similar to the 'src/dst' nomenclature we are already using when moving tasks between runqueues. To fix both quirks, and to standardize scheduler load-balancing function names on the sched_balance_() prefix, rename the function to sched_balance_find_src_rq(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-7-mingo@kernel.org
2024-03-12sched/balancing: Rename load_balance() => sched_balance_rq()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Also load_balance() has become somewhat of a misnomer: historically it was the first and primary load-balancing function that was called, but with the introduction of sched domains, it's become a lower layer function that balances runqueues. Rename it to sched_balance_rq() accordingly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-6-mingo@kernel.org
2024-03-12sched/balancing: Rename rebalance_domains() => sched_balance_domains()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-5-mingo@kernel.org
2024-03-12sched/balancing: Rename trigger_load_balance() => sched_balance_trigger()Ingo Molnar
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-4-mingo@kernel.org
2024-03-12sched/balancing: Rename scheduler_tick() => sched_tick()Ingo Molnar
- Standardize on prefixing scheduler-internal functions defined in <linux/sched.h> with sched_*() prefix. scheduler_tick() was the only function using the scheduler_ prefix. Harmonize it. - The other reason to rename it is the NOHZ scheduler tick handling functions are already named sched_tick_*(). Make the 'git grep sched_tick' more meaningful. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org
2024-03-12sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq()Ingo Molnar
run_rebalance_domains() is a misnomer, as it doesn't only run rebalance_domains(), but since the introduction of the NOHZ code it also runs nohz_idle_balance(). Rename it to sched_balance_softirq(), reflecting its more generic purpose and that it's a softirq handler. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-2-mingo@kernel.org
2024-03-12sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct ↵Ingo Molnar
sd_lb_stats' - Align for readability - Capitalize consistently Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-11-mingo@kernel.org
2024-03-12sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and ↵Ingo Molnar
'struct sd_lb_stats' Make them easier to read. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-10-mingo@kernel.org
2024-03-12sched/balancing: Update run_rebalance_domains() commentsIngo Molnar
The first sentence of the comment explaining run_rebalance_domains() is historic and not true anymore: * run_rebalance_domains is triggered when needed from the scheduler tick. ... contradicted/modified by the second sentence: * Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set). Avoid that kind of confusion straight away and explain from what places sched_balance_softirq() is triggered. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240308105901.1096078-9-mingo@kernel.org
2024-03-12sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICKIngo Molnar
Fix two typos: - There's no such thing as 'nohz_balancing_kick', the flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK. - Likewise there's no such thing as a 'pending nohz_balance_kick' either, the NOHZ_BALANCE_KICK flag is all upper-case. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-8-mingo@kernel.org
2024-03-12sched/balancing: Change comment formatting to not overlap Git conflict ↵Ingo Molnar
marker lines So the scheduler has two such comment blocks, with '=' used as a double underline: /* * VRUNTIME * ======== * '========' also happens to be a Git conflict marker, throwing off a simple search in an editor for this pattern. Change them to '-------' type of underline instead - it looks just as good. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-7-mingo@kernel.org
2024-03-12sched/debug: Increase SCHEDSTAT_VERSION to 16Ingo Molnar
We changed the order of definitions within 'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES] columns in show_schedstat(). Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20240308105901.1096078-5-mingo@kernel.org
2024-03-12sched/balancing: Change 'enum cpu_idle_type' to have more natural definitionsIngo Molnar
The cpu_idle_type enum has the confusingly inverted property that 'not idle' is 1, and 'idle' is '0'. This resulted in a number of unnecessary complications in the code. Reverse the order, remove the CPU_NOT_IDLE type, and convert all code to a natural boolean form. It's much more readable: - enum cpu_idle_type idle = this_rq->idle_balance ? - CPU_IDLE : CPU_NOT_IDLE; - + enum cpu_idle_type idle = this_rq->idle_balance; -------------------------------- - if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running) + if (!env->idle || !busiest->sum_nr_running) -------------------------------- And gets rid of the double negation in these usages: - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) + if (env->idle && env->src_rq->nr_running <= 1) Furthermore, this makes code much more obvious where there's differentiation between CPU_IDLE and CPU_NEWLY_IDLE. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20240308105901.1096078-4-mingo@kernel.org
2024-03-12sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when ↵Shrikanth Hegde
iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat() show_schedstat() output breaks and doesn't print all entries if the ordering of the definitions in 'enum cpu_idle_type' is changed, because show_schedstat() assumes that 'CPU_IDLE' is 0. Fix it before we change the definition order & values. [ mingo: Added changelog. ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20240308105901.1096078-3-mingo@kernel.org
2024-03-12sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an ↵Ingo Molnar
'atomic_t sched_balance_running' flag The 'balancing' spinlock added in: 08c183f31bdb ("[PATCH] sched: add option to serialize load balancing") ... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it is a glorified global atomic flag serializing the load-balancing of those domains. It doesn't have any explicit locking semantics per se: we just spin_trylock() it. Turn it into a ... global atomic flag. This makes it more clear what is going on here, and reduces overhead and code size a bit: # kernel/sched/fair.o: [x86-64 defconfig] text data bss dec hex filename 60730 2721 104 63555 f843 fair.o.before 60718 2721 104 63543 f837 fair.o.after Also document the flag a bit. No change in functionality intended. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308105901.1096078-2-mingo@kernel.org
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'sched-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Fix inconsistency in misfit task load-balancing - Fix CPU isolation bugs in the task-wakeup logic - Rework and unify the sched_use_asym_prio() and sched_asym_prefer() logic - Clean up and simplify ->avg_* accesses - Misc cleanups and fixes * tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched/fair: Remove unused parameter from sched_asym() sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS sched/fair: Simplify the update_sd_pick_busiest() logic sched/fair: Do strict inequality check for busiest misfit task group sched/fair: Remove unnecessary goto in update_sd_lb_stats() sched/fair: Take the scheduling domain into account in select_idle_core() sched/fair: Take the scheduling domain into account in select_idle_smt() sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl sched/core: Simplify code by removing duplicate #ifdefs
2024-03-11Merge tag 'locking-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Micro-optimize local_xchg() and the rtmutex code on x86 - Fix percpu-rwsem contention tracepoints - Simplify debugging Kconfig dependencies - Update/clarify the documentation of atomic primitives - Misc cleanups * tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters() locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix locking/percpu-rwsem: Trigger contention tracepoints only if contended locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint locking/mutex: Simplify <linux/mutex.h> locking/qspinlock: Fix 'wait_early' set but not used warning locking/atomic: scripts: Clarify ordering of conditional atomics
2024-03-11Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11bpf: move sleepable flag from bpf_prog_aux to bpf_progAndrii Nakryiko
prog->aux->sleepable is checked very frequently as part of (some) BPF program run hot paths. So this extra aux indirection seems wasteful and on busy systems might cause unnecessary memory cache misses. Let's move sleepable flag into prog itself to eliminate unnecessary pointer dereference. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Message-ID: <20240309004739.2961431-1-andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()Puranjay Mohan
On some architectures like ARM64, PMD_SIZE can be really large in some configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is 512MB. Use 2MB * num_possible_nodes() as the size for allocations done through the prog pack allocator. On most architectures, PMD_SIZE will be equal to 2MB in case of 4KB pages and will be greater than 2MB for bigger page sizes. Fixes: ea2babac63d4 ("bpf: Simplify bpf_prog_pack_[size|mask]") Reported-by: "kernelci.org bot" <bot@kernelci.org> Closes: https://lore.kernel.org/all/7e216c88-77ee-47b8-becc-a0f780868d3c@sirena.org.uk/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403092219.dhgcuz2G-lkp@intel.com/ Suggested-by: Song Liu <song@kernel.org> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Message-ID: <20240311122722.86232-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11Merge tag 'x86-apic-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: "Rework of APIC enumeration and topology evaluation. The current implementation has a couple of shortcomings: - It fails to handle hybrid systems correctly. - The APIC registration code which handles CPU number assignents is in the middle of the APIC code and detached from the topology evaluation. - The various mechanisms which enumerate APICs, ACPI, MPPARSE and guest specific ones, tweak global variables as they see fit or in case of XENPV just hack around the generic mechanisms completely. - The CPUID topology evaluation code is sprinkled all over the vendor code and reevaluates global variables on every hotplug operation. - There is no way to analyze topology on the boot CPU before bringing up the APs. This causes problems for infrastructure like PERF which needs to size certain aspects upfront or could be simplified if that would be possible. - The APIC admission and CPU number association logic is incomprehensible and overly complex and needs to be kept around after boot instead of completing this right after the APIC enumeration. This update addresses these shortcomings with the following changes: - Rework the CPUID evaluation code so it is common for all vendors and provides information about the APIC ID segments in a uniform way independent of the number of segments (Thread, Core, Module, ..., Die, Package) so that this information can be computed instead of rewriting global variables of dubious value over and over. - A few cleanups and simplifcations of the APIC, IO/APIC and related interfaces to prepare for the topology evaluation changes. - Seperation of the parser stages so the early evaluation which tries to find the APIC address can be seperately overridden from the late evaluation which enumerates and registers the local APIC as further preparation for sanitizing the topology evaluation. - A new registration and admission logic which - encapsulates the inner workings so that parsers and guest logic cannot longer fiddle in it - uses the APIC ID segments to build topology bitmaps at registration time - provides a sane admission logic - allows to detect the crash kernel case, where CPU0 does not run on the real BSP, automatically. This is required to prevent sending INIT/SIPI sequences to the real BSP which would reset the whole machine. This was so far handled by a tedious command line parameter, which does not even work in nested crash scenarios. - Associates CPU number after the enumeration completed and prevents the late registration of APICs, which was somehow tolerated before. - Converting all parsers and guest enumeration mechanisms over to the new interfaces. This allows to get rid of all global variable tweaking from the parsers and enumeration mechanisms and sanitizes the XEN[PV] handling so it can use CPUID evaluation for the first time. - Mopping up existing sins by taking the information from the APIC ID segment bitmaps. This evaluates hybrid systems correctly on the boot CPU and allows for cleanups and fixes in the related drivers, e.g. PERF. The series has been extensively tested and the minimal late fallout due to a broken ACPI/MADT table has been addressed by tightening the admission logic further" * tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits) x86/topology: Ignore non-present APIC IDs in a present package x86/apic: Build the x86 topology enumeration functions on UP APIC builds too smp: Provide 'setup_max_cpus' definition on UP too smp: Avoid 'setup_max_cpus' namespace collision/shadowing x86/bugs: Use fixed addressing for VERW operand x86/cpu/topology: Get rid of cpuinfo::x86_max_cores x86/cpu/topology: Provide __num_[cores|threads]_per_package x86/cpu/topology: Rename topology_max_die_per_package() x86/cpu/topology: Rename smp_num_siblings x86/cpu/topology: Retrieve cores per package from topology bitmaps x86/cpu/topology: Use topology logical mapping mechanism x86/cpu/topology: Provide logical pkg/die mapping x86/cpu/topology: Simplify cpu_mark_primary_thread() x86/cpu/topology: Mop up primary thread mask handling x86/cpu/topology: Use topology bitmaps for sizing x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT x86/xen/smp_pv: Count number of vCPUs early x86/cpu/topology: Assign hotpluggable CPUIDs during init x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug x86/topology: Add a mechanism to track topology via APIC IDs ...
2024-03-11bpf: Recognize btf_decl_tag("arg: Arena") as PTR_TO_ARENA.Alexei Starovoitov
In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA. Note, when the verifier sees: __weak void foo(struct bar *p) it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars. Hence the only way to use arena pointers in global functions is to tag them with "arg:arena". Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-7-alexei.starovoitov@gmail.com
2024-03-11bpf: Recognize addr_space_cast instruction in the verifier.Alexei Starovoitov
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have to be in 32-bit domain. The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as kern_vm_start + 32bit_addr memory accesses. rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well. Otherwise JIT will convert it to: rY = (u32)rX; if (rY) rY |= arena->user_vm_start & ~(u64)~0U; Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
2024-03-11bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.Alexei Starovoitov
LLVM generates bpf_addr_space_cast instruction while translating pointers between native (zero) address space and __attribute__((address_space(N))). The addr_space=1 is reserved as bpf_arena address space. rY = addr_space_cast(rX, 0, 1) is processed by the verifier and converted to normal 32-bit move: wX = wY rY = addr_space_cast(rX, 1, 0) has to be converted by JIT: aux_reg = upper_32_bits of arena->user_vm_start aux_reg <<= 32 wX = wY // clear upper 32 bits of dst register if (wX) // if not zero add upper bits of user_vm_start wX |= aux_reg JIT can do it more efficiently: mov dst_reg32, src_reg32 // 32-bit move shl dst_reg, 32 or dst_reg, user_vm_start rol dst_reg, 32 xor r11, r11 test dst_reg32, dst_reg32 // check if lower 32-bit are zero cmove r11, dst_reg // if so, set dst_reg to zero // Intel swapped src/dst register encoding in CMOVcc Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
2024-03-11bpf: Disasm support for addr_space_cast instruction.Alexei Starovoitov
LLVM generates rX = addr_space_cast(rY, dst_addr_space, src_addr_space) instruction when pointers in non-zero address space are used by the bpf program. Recognize this insn in uapi and in bpf disassembler. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-3-alexei.starovoitov@gmail.com
2024-03-11bpf: Introduce bpf_arena.Alexei Starovoitov
Introduce bpf_arena, which is a sparse shared memory region between the bpf program and user space. Use cases: 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous region, like memcached or any key/value storage. The bpf program implements an in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a value without going to user space. 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables, rb-trees, sparse arrays), while user space consumes it. 3. bpf_arena is a "heap" of memory from the bpf program's point of view. The user space may mmap it, but bpf program will not convert pointers to user base at run-time to improve bpf program speed. Initially, the kernel vm_area and user vma are not populated. User space can fault in pages within the range. While servicing a page fault, bpf_arena logic will insert a new page into the kernel and user vmas. The bpf program can allocate pages from that region via bpf_arena_alloc_pages(). This kernel function will insert pages into the kernel vm_area. The subsequent fault-in from user space will populate that page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in from user space. In such a case, if a page is not allocated by the bpf program and not present in the kernel vm_area, the user process will segfault. This is useful for use cases 2 and 3 above. bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages either at a specific address within the arena or allocates a range with the maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages and removes the range from the kernel vm_area and from user process vmas. bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf program is more important than ease of sharing with user space. This is use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is not NULL) to arena pointers before they are stored into memory. This way, user space sees them as valid 64-bit pointers. Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF backend generate the bpf_addr_space_cast() instruction to cast pointers between address_space(1) which is reserved for bpf_arena pointers and default address space zero. All arena pointers in a bpf program written in C language are tagged as __attribute__((address_space(1))). Hence, clang provides helpful diagnostics when pointers cross address space. Libbpf and the kernel support only address_space == 1. All other address space identifiers are reserved. rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the verifier that rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar to copy_from_kernel_nofault() except that no address checks are necessary. The address is guaranteed to be in the 4GB range. If the page is not present, the destination register is zeroed on read, and the operation is ignored on write. rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the verifier converts such cast instructions to mov32. Otherwise, JIT will emit native code equivalent to: rX = (u32)rY; if (rY) rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */ After such conversion, the pointer becomes a valid user pointer within bpf_arena range. The user process can access data structures created in bpf_arena without any additional computations. For example, a linked list built by a bpf program can be walked natively by user space. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Barret Rhoden <brho@google.com> Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
2024-03-11Merge tag 'timers-core-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A large set of updates and features for timers and timekeeping: - The hierarchical timer pull model When timer wheel timers are armed they are placed into the timer wheel of a CPU which is likely to be busy at the time of expiry. This is done to avoid wakeups on potentially idle CPUs. This is wrong in several aspects: 1) The heuristics to select the target CPU are wrong by definition as the chance to get the prediction right is close to zero. 2) Due to #1 it is possible that timers are accumulated on a single target CPU 3) The required computation in the enqueue path is just overhead for dubious value especially under the consideration that the vast majority of timer wheel timers are either canceled or rearmed before they expire. The timer pull model avoids the above by removing the target computation on enqueue and queueing timers always on the CPU on which they get armed. This is achieved by having separate wheels for CPU pinned timers and global timers which do not care about where they expire. As long as a CPU is busy it handles both the pinned and the global timers which are queued on the CPU local timer wheels. When a CPU goes idle it evaluates its own timer wheels: - If the first expiring timer is a pinned timer, then the global timers can be ignored as the CPU will wake up before they expire. - If the first expiring timer is a global timer, then the expiry time is propagated into the timer pull hierarchy and the CPU makes sure to wake up for the first pinned timer. The timer pull hierarchy organizes CPUs in groups of eight at the lowest level and at the next levels groups of eight groups up to the point where no further aggregation of groups is required, i.e. the number of levels is log8(NR_CPUS). The magic number of eight has been established by experimention, but can be adjusted if needed. In each group one busy CPU acts as the migrator. It's only one CPU to avoid lock contention on remote timer wheels. The migrator CPU checks in its own timer wheel handling whether there are other CPUs in the group which have gone idle and have global timers to expire. If there are global timers to expire, the migrator locks the remote CPU timer wheel and handles the expiry. Depending on the group level in the hierarchy this handling can require to walk the hierarchy downwards to the CPU level. Special care is taken when the last CPU goes idle. At this point the CPU is the systemwide migrator at the top of the hierarchy and it therefore cannot delegate to the hierarchy. It needs to arm its own timer device to expire either at the first expiring timer in the hierarchy or at the first CPU local timer, which ever expires first. This completely removes the overhead from the enqueue path, which is e.g. for networking a true hotpath and trades it for a slightly more complex idle path. This has been in development for a couple of years and the final series has been extensively tested by various teams from silicon vendors and ran through extensive CI. There have been slight performance improvements observed on network centric workloads and an Intel team confirmed that this allows them to power down a die completely on a mult-die socket for the first time in a mostly idle scenario. There is only one outstanding ~1.5% regression on a specific overloaded netperf test which is currently investigated, but the rest is either positive or neutral performance wise and positive on the power management side. - Fixes for the timekeeping interpolation code for cross-timestamps: cross-timestamps are used for PTP to get snapshots from hardware timers and interpolated them back to clock MONOTONIC. The changes address a few corner cases in the interpolation code which got the math and logic wrong. - Simplifcation of the clocksource watchdog retry logic to automatically adjust to handle larger systems correctly instead of having more incomprehensible command line parameters. - Treewide consolidation of the VDSO data structures. - The usual small improvements and cleanups all over the place" * tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) timer/migration: Fix quick check reporting late expiry tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n vdso/datapage: Quick fix - use asm/page-def.h for ARM64 timers: Assert no next dyntick timer look-up while CPU is offline tick: Assume timekeeping is correctly handed over upon last offline idle call tick: Shut down low-res tick from dying CPU tick: Split nohz and highres features from nohz_mode tick: Move individual bit features to debuggable mask accesses tick: Move got_idle_tick away from common flags tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING tick: Move tick cancellation up to CPUHP_AP_TICK_DYING tick: Start centralizing tick related CPU hotplug operations tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick: Use IS_ENABLED() whenever possible tick/sched: Remove useless oneshot ifdeffery tick/nohz: Remove duplicate between lowres and highres handlers tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() hrtimer: Select housekeeping CPU during migration ...
2024-03-11Merge tag 'timers-ptp-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for timekeeping and PTP core. The cross-timestamp mechanism which allows to correlate hardware clocks uses clocksource pointers for describing the correlation. That's suboptimal as drivers need to obtain the pointer, which requires needless exports and exposing internals. This can all be completely avoided by assigning clocksource IDs and using them for describing the correlated clock source. So this adds clocksource IDs to all clocksources in the tree which can be exposed to this mechanism and removes the pointer and now needless exports. A related improvement for the core and the correlation handling has not made it this time, but is expected to get ready for the next round" * tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kvmclock: Unexport kvmclock clocksource treewide: Remove system_counterval_t.cs, which is never read timekeeping: Evaluate system_counterval_t.cs_id instead of .cs ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id x86/tsc: Add clocksource ID, set system_counterval_t.cs_id timekeeping: Add clocksource ID to struct system_counterval_t x86/tsc: Correct kernel-doc notation
2024-03-11Merge tag 'smp-core-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu core updates from Thomas Gleixner: "A small boring set of cleanups for the SMP and CPU hotplug code" * tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Remove stray semicolon smp: Make __smp_processor_id() 0-argument macro cpu: Mark cpu_possible_mask as __ro_after_init kernel/cpu: Convert snprintf() to sysfs_emit() cpu/hotplug: Delete an extraneous kernel-doc description
2024-03-11Merge tag 'irq-msi-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI updates from Thomas Gleixner: "Updates for the MSI interrupt subsystem and initial RISC-V MSI support. The core changes have been adopted from previous work which converted ARM[64] to the new per device MSI domain model, which was merged to support multiple MSI domain per device. The ARM[64] changes are being worked on too, but have not been ready yet. The core and platform-MSI changes have been split out to not hold up RISC-V and to avoid that RISC-V builds on the scheduled for removal interfaces. The core support provides new interfaces to handle wire to MSI bridges in a straight forward way and introduces new platform-MSI interfaces which are built on top of the per device MSI domain model. Once ARM[64] is converted over the old platform-MSI interfaces and the related ugliness in the MSI core code will be removed. The actual MSI parts for RISC-V were finalized late and have been post-poned for the next merge window. Drivers: - Add a new driver for the Andes hart-level interrupt controller - Rework the SiFive PLIC driver to prepare for MSI suport - Expand the RISC-V INTC driver to support the new RISC-V AIA controller which provides the basis for MSI on RISC-V - A few fixup for the fallout of the core changes" * tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search genirq/matrix: Dynamic bitmap allocation irqchip/riscv-intc: Add support for RISC-V AIA irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe() irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode irqchip/sifive-plic: Use devm_xyz() for managed allocation irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz() irqchip/sifive-plic: Convert PLIC driver into a platform driver irqchip/riscv-intc: Introduce Andes hart-level interrupt controller irqchip/riscv-intc: Allow large non-standard interrupt number genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens irqchip/imx-intmux: Handle pure domain searches correctly genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV genirq/irqdomain: Reroute device MSI create_mapping genirq/msi: Provide allocation/free functions for "wired" MSI interrupts genirq/msi: Optionally use dev->fwnode for device domain genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI ...
2024-03-11Merge tag 'irq-core-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Core: - Make affinity changes take effect immediately for interrupt threads. This reduces the impact on isolated CPUs as it pulls over the thread right away instead of doing it after the next hardware interrupt arrived. - Cleanup and improvements for the interrupt chip simulator - Deduplication of the interrupt descriptor initialization code so the sparse and non-sparse mode share more code. Drivers: - A set of conversions to platform_drivers::remove_new() which gets rid of the pointless return value. - A new driver for the Starfive JH8100 SoC - Support for Amlogic-T7 SoCs - Improvement for the interrupt handling and EOI management for the loongson interrupt controller. - The usual fixes and improvements all over the place" * tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) irqchip/ts4800: Convert to platform_driver::remove_new() callback irqchip/stm32-exti: Convert to platform_driver::remove_new() callback irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback irqchip/pruss-intc: Convert to platform_driver::remove_new() callback irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback irqchip/madera: Convert to platform_driver::remove_new() callback irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback irqchip/keystone: Convert to platform_driver::remove_new() callback irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback irqchip/imx-intmux: Convert to platform_driver::remove_new() callback irqchip/imgpdc: Convert to platform_driver::remove_new() callback irqchip: Add StarFive external interrupt controller dt-bindings: interrupt-controller: Add starfive,jh8100-intc arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs irqchip/meson-gpio: Add support for Amlogic-T7 SoCs dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs irqchip/vic: Fix a kernel-doc warning genirq: Wake interrupt threads immediately when changing affinity ...
2024-03-11Merge tag 'cgroup-for-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A quiet cycle. One trivial doc update patch. Two patches to drop the now defunct memory_spread_slab feature from cgroup1 cpuset" * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Mark memory_spread_slab as obsolete cgroup/cpuset: Remove cpuset_do_slab_mem_spread() docs: cgroup-v1: add missing code-block tags
2024-03-11Merge tag 'wq-for-6.9-bh-conversions' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue BH conversions from Tejun Heo: "This contains two patches that convert tasklet users to BH workqueues: backtracetest and usb hcd. DM conversions are being routed through the respective subsystem tree. Hopefully, the next cycle will see a lot more conversions" * tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: usb: core: hcd: Convert from tasklet to BH workqueue backtracetest: Convert from tasklet to BH workqueue