summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-12-17Merge tag 'xsa465+xsa466-6.13-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Fix xen netfront crash (XSA-465) and avoid using the hypercall page that doesn't do speculation mitigations (XSA-466)" * tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: remove hypercall page x86/xen: use new hypercall functions instead of hypercall page x86/xen: add central hypercall functions x86/xen: don't do PV iret hypercall through hypercall page x86/static-call: provide a way to do very early static-call updates objtool/x86: allow syscall instruction x86: make get_cpu_vendor() accessible from Xen code xen/netfront: fix crash when removing device
2024-12-17pidfs: lookup pid through rbtreeChristian Brauner
The new pid inode number allocation scheme is neat but I overlooked a possible, even though unlikely, attack that can be used to trigger an overflow on both 32bit and 64bit. An unique 64 bit identifier was constructed for each struct pid by two combining a 32 bit idr with a 32 bit generation number. A 32bit number was allocated using the idr_alloc_cyclic() infrastructure. When the idr wrapped around a 32 bit wraparound counter was incremented. The 32 bit wraparound counter served as the upper 32 bits and the allocated idr number as the lower 32 bits. Since the idr can only allocate up to INT_MAX entries everytime a wraparound happens INT_MAX - 1 entries are lost (Ignoring that numbering always starts at 2 to avoid theoretical collisions with the root inode number.). If userspace fully populates the idr such that and puts itself into control of two entries such that one entry is somewhere in the middle and the other entry is the INT_MAX entry then it is possible to overflow the wraparound counter. That is probably difficult to pull off but the mere possibility is annoying. The problem could be contained to 32 bit by switching to a data structure such as the maple tree that allows allocating 64 bit numbers on 64 bit machines. That would leave 32 bit in a lurch but that probably doesn't matter that much. The other problem is that removing entries form the maple tree is somewhat non-trivial because the removal code can be called under the irq write lock of tasklist_lock and irq{save,restore} code. Instead, allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles. On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers. On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again. When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit. Link: https://lore.kernel.org/r/20241214-gekoppelt-erdarbeiten-a1f9a982a5a6@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-16exec: Make sure task->comm is always NUL-terminatedKees Cook
Using strscpy() meant that the final character in task->comm may be non-NUL for a moment before the "string too long" truncation happens. Instead of adding a new use of the ambiguous strncpy(), we'd want to use memtostr_pad() which enforces being able to check at compile time that sizes are sensible, but this requires being able to see string buffer lengths. Instead of trying to inline __set_task_comm() (which needs to call trace and perf functions), just open-code it. But to make sure we're always safe, add compile-time checking like we already do for get_task_comm(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-16ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not setSteven Rostedt
When function tracing and function graph tracing are both enabled (in different instances) the "parent" of some of the function tracing events is "return_to_handler" which is the trampoline used by function graph tracing. To fix this, ftrace_get_true_parent_ip() was introduced that returns the "true" parent ip instead of the trampoline. To do this, the ftrace_regs_get_stack_pointer() is used, which uses kernel_stack_pointer(). The problem is that microblaze does not implement kerenl_stack_pointer() so when function graph tracing is enabled, the build fails. But microblaze also does not enabled HAVE_DYNAMIC_FTRACE_WITH_ARGS. That option has to be enabled by the architecture to reliably get the values from the fregs parameter passed in. When that config is not set, the architecture can also pass in NULL, which is not tested for in that function and could cause the kernel to crash. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Cc: Jeff Xie <jeff.xie@linux.dev> Link: https://lore.kernel.org/20241216164633.6df18e87@gandalf.local.home Fixes: 60b1f578b578 ("ftrace: Get the true parent ip for function tracer") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-16fgraph: Still initialize idle shadow stacks when startingSteven Rostedt
A bug was discovered where the idle shadow stacks were not initialized for offline CPUs when starting function graph tracer, and when they came online they were not traced due to the missing shadow stack. To fix this, the idle task shadow stack initialization was moved to using the CPU hotplug callbacks. But it removed the initialization when the function graph was enabled. The problem here is that the hotplug callbacks are called when the CPUs come online, but the idle shadow stack initialization only happens if function graph is currently active. This caused the online CPUs to not get their shadow stack initialized. The idle shadow stack initialization still needs to be done when the function graph is registered, as they will not be allocated if function graph is not registered. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home Fixes: 2c02f7375e65 ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks") Reported-by: Linus Walleij <linus.walleij@linaro.org> Tested-by: Linus Walleij <linus.walleij@linaro.org> Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgGAOcQ@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: Auto-merging include/linux/bpf.h Auto-merging include/linux/bpf_verifier.h Auto-merging kernel/bpf/btf.c Auto-merging kernel/bpf/verifier.c Auto-merging kernel/trace/bpf_trace.c Auto-merging tools/testing/selftests/bpf/progs/test_tp_btf_nullable.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-16printk: Defer legacy printing when holding printk_cpu_syncJohn Ogness
The documentation of printk_cpu_sync_get() clearly states that the owner must never perform any activities where it waits for a CPU. For legacy printing there can be spinning on the console_lock and on the port lock. Therefore legacy printing must be deferred when holding the printk_cpu_sync. Note that in the case of emergency states, atomic consoles are not prevented from printing when printk is deferred. This is appropriate because they do not spin-wait indefinitely for other CPUs. Reported-by: Rik van Riel <riel@surriel.com> Closes: https://lore.kernel.org/r/20240715232052.73eb7fb1@imladris.surriel.com Signed-off-by: John Ogness <john.ogness@linutronix.de> Fixes: 55d6af1d6688 ("lib/nmi_backtrace: explicitly serialize banner and regs") Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20241209111746.192559-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-12-16printk: Remove redundant deferred check in vprintk()John Ogness
The helper printk_get_console_flush_type() is already calling is_printk_legacy_deferred() to determine if legacy printing is to be offloaded. Therefore there is no need for vprintk() to perform this check as well. Remove the redundant check from vprintk(). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20241209111746.192559-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-12-15lockdep: Document MAX_LOCKDEP_CHAIN_HLOCKS calculationCarlos Llamas
Define a macro AVG_LOCKDEP_CHAIN_DEPTH to document the magic number '5' used in the calculation of MAX_LOCKDEP_CHAIN_HLOCKS. The number represents the estimated average depth (number of locks held) of a lock chain. The calculation of MAX_LOCKDEP_CHAIN_HLOCKS was first added in commit 443cd507ce7f ("lockdep: add lock_class information to lock_chain and output it"). Suggested-by: Waiman Long <longman@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20241024183631.643450-4-cmllamas@google.com
2024-12-15locking/ww_mutex/test: Use swap() macroThorsten Blum
Fixes the following Coccinelle/coccicheck warning reported by swap.cocci: WARNING opportunity for swap() Compile-tested only. [Boqun: Add the report tags from Jiapeng and Abaci Robot [1].] Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reported-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=11531 Link: https://lore.kernel.org/r/20241025081455.55089-1-jiapeng.chong@linux.alibaba.com [1] Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Link: https://lore.kernel.org/r/20240731135850.81018-2-thorsten.blum@toblux.com
2024-12-15Merge tag 'sched_urgent_for_v6.13_rc3-p2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Prevent incorrect dequeueing of the deadline dlserver helper task and fix its time accounting - Properly track the CFS runqueue runnable stats - Check the total number of all queued tasks in a sched fair's runqueue hierarchy before deciding to stop the tick - Fix the scheduling of the task that got woken last (NEXT_BUDDY) by preventing those from being delayed * tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/dlserver: Fix dlserver time accounting sched/dlserver: Fix dlserver double enqueue sched/eevdf: More PELT vs DELAYED_DEQUEUE sched/fair: Fix sched_can_stop_tick() for fair tasks sched/fair: Fix NEXT_BUDDY
2024-12-14Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix a bug in the BPF verifier to track changes to packet data property for global functions (Eduard Zingerman) - Fix a theoretical BPF prog_array use-after-free in RCU handling of __uprobe_perf_func (Jann Horn) - Fix BPF tracing to have an explicit list of tracepoints and their arguments which need to be annotated as PTR_MAYBE_NULL (Kumar Kartikeya Dwivedi) - Fix a logic bug in the bpf_remove_insns code where a potential error would have been wrongly propagated (Anton Protopopov) - Avoid deadlock scenarios caused by nested kprobe and fentry BPF programs (Priya Bala Govindasamy) - Fix a bug in BPF verifier which was missing a size check for BTF-based context access (Kumar Kartikeya Dwivedi) - Fix a crash found by syzbot through an invalid BPF prog_array access in perf_event_detach_bpf_prog (Jiri Olsa) - Fix several BPF sockmap bugs including a race causing a refcount imbalance upon element replace (Michal Luczaj) - Fix a use-after-free from mismatching BPF program/attachment RCU flavors (Jann Horn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits) bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs selftests/bpf: Add tests for raw_tp NULL args bpf: Augment raw_tp arguments with PTR_MAYBE_NULL bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL" selftests/bpf: Add test for narrow ctx load for pointer args bpf: Check size for BTF-based ctx access of pointer members selftests/bpf: extend changes_pkt_data with cases w/o subprograms bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs bpf: Fix theoretical prog_array UAF in __uprobe_perf_func() bpf: fix potential error return selftests/bpf: validate that tail call invalidates packet pointers bpf: consider that tail calls invalidate packet pointers selftests/bpf: freplace tests for tracking of changes_packet_data bpf: check changes_pkt_data property for extension programs selftests/bpf: test for changing packet data from global functions bpf: track changes_pkt_data property for global functions bpf: refactor bpf_helper_changes_pkt_data to use helper number bpf: add find_containing_subprog() utility function bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors ...
2024-12-14bpf: Avoid deadlock caused by nested kprobe and fentry bpf programsPriya Bala Govindasamy
BPF program types like kprobe and fentry can cause deadlocks in certain situations. If a function takes a lock and one of these bpf programs is hooked to some point in the function's critical section, and if the bpf program tries to call the same function and take the same lock it will lead to deadlock. These situations have been reported in the following bug reports. In percpu_freelist - Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/ In bpf_lru_list - Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/ Similar bugs have been reported by syzbot. In queue_stack_maps - Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/ Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/ In lpm_trie - Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/ In ringbuf - Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/ Prevent kprobe and fentry bpf programs from attaching to these critical sections by removing CC_FLAGS_FTRACE for percpu_freelist.o, bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files. The bugs reported by syzbot are due to tracepoint bpf programs being called in the critical sections. This patch does not aim to fix deadlocks caused by tracepoint programs. However, it does prevent deadlocks from occurring in similar situations due to kprobe and fentry programs. Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu> Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-14Merge branches 'fixes.2024.12.14a', 'rcutorture.2024.12.14a', ↵Uladzislau Rezki (Sony)
'srcu.2024.12.14a' and 'torture-test.2024.12.14a' into rcu-merge.2024.12.14a fixes.2024.12.14a: RCU fixes rcutorture.2024.12.14a: Torture-test updates srcu.2024.12.14a: SRCU updates torture-test.2024.12.14a: Adding an extra test, fixes
2024-12-14srcu: Remove redundant GP sequence checks in srcu_funnel_gp_startFeng Lee
We will perform GP sequence checking at the beginning of srcu_gp_start, thus making it safe to remove duplicate GP sequence checks prior to calling srcu_gp_start. Signed-off-by: Feng Lee <379943137@qq.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14srcu: Guarantee non-negative return value from srcu_read_lock()Paul E. McKenney
For almost 20 years, the int return value from srcu_read_lock() has been always either zero or one. This commit therefore documents the fact that it will be non-negative, and does the same for the underlying __srcu_read_lock(). [ paulmck: Apply Andrii Nakryiko feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Add lockdep_assert_irqs_disabled() to rcu_exp_need_qs()Paul E. McKenney
Callers to rcu_exp_need_qs() are supposed to disable interrupts, so this commit enlists lockdep's aid in checking this. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Add KCSAN exclusive-writer assertions for rdp->cpu_no_qs.b.expPaul E. McKenney
The value of rdp->cpu_no_qs.b.exp may be changed only by the corresponding CPU, and that CPU is not even allowed to race with itself, for example, via interrupt handlers. This commit therefore adds KCSAN exclusive-writer assertions to check this constraint. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Make preemptible rcu_exp_handler() check idempotencyPaul E. McKenney
Although the non-preemptible implementation of rcu_exp_handler() contains checks to enforce idempotency, the preemptible version does not. The reason for this omission is that in preemptible kernels, there is no reporting of quiescent states from CPU hotplug notifiers, and thus no need for idempotency. In theory, anyway. In practice, accidents happen. This commit therefore adds checks under WARN_ON_ONCE() to catch any such accidents. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Replace open-coded rcu_exp_need_qs() from rcu_exp_handler() with callPaul E. McKenney
Currently, the preemptible implementation of rcu_exp_handler() almost open-codes rcu_exp_need_qs(). A call to that function would be shorter and would improve expediting in cases where rcu_exp_handler() interrupted a preemption-disabled or bh-disabled region of code. This commit therefore moves rcu_exp_need_qs() out of the non-preemptible leg of the enclosing #ifdef and replaces the open coding in preemptible rcu_exp_handler() with a call to rcu_exp_need_qs(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Move rcu_report_exp_rdp() setting of ->cpu_no_qs.b.exp under lockPaul E. McKenney
This commit reduces the state space of rcu_report_exp_rdp() by moving the setting of ->cpu_no_qs.b.exp under the rcu_node structure's ->lock. The lock isn't really all that important here, given that this per-CPU field is supposed to be written only by its CPU, but the disabling of interrupts excludes things like rcu_exp_handler(), which also can write to this same field. Avoiding this sort of interleaved access reduces the state space. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Make rcu_report_exp_cpu_mult() caller acquire lockPaul E. McKenney
There is a hard-to-trigger bug in the expedited grace-period computation whose fix requires that the __sync_rcu_exp_select_node_cpus() function to check that the grace-period sequence number has not changed before invoking rcu_report_exp_cpu_mult(). However, this check must be done while holding the leaf rcu_node structure's ->lock. This commit therefore prepares for that fix by moving this lock's acquisition from rcu_report_exp_cpu_mult() to its callers (all two of them). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcu: Report callbacks enqueued on offline CPU blind spotFrederic Weisbecker
Callbacks enqueued after rcutree_report_cpu_dead() fall into RCU barrier blind spot. Report any potential misuse. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Use symbols for SRCU reader flavorsPaul E. McKenney
This commit converts rcutorture.c values for the reader_flavor module parameter from hexadecimal to the SRCU_READ_FLAVOR_* C-preprocessor macros. The actual modprobe or kernel-boot-parameter values for read_flavor must still be entered in hexadecimal. Link: https://lore.kernel.org/all/c48c9dca-fe07-4833-acaa-28c827e5a79e@amd.com/ Suggested-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add per-reader-segment preemption diagnosticsPaul E. McKenney
For preemptible RCU, this commit adds an indication for each reader segments to whether the rcu_torture_reader() task was on the ->blkd_tasks lists, though only in kernels built with CONFIG_RCU_TORTURE_TEST_LOG_CPU=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Read CPU ID for decoration protected by both reader typesPaul E. McKenney
Currently, rcutorture_one_extend() reads the CPU ID before making any change to the type of RCU reader. This can be confusing because the properties of the code from which the CPU ID is read are not that of the reader segment that this same CPU ID is listed with. This commit therefore causes rcutorture_one_extend() to read the CPU ID just after the new protections have been added, but before the old protections have been removed. With this change in place, all of the protections of a given reader segment apply from the reading of one CPU ID to the reading of the next. This change therefore also allows a single read of the CPU ID to work for both the old and the new reader segment. And this dual use of a single read of the CPU ID avoids inflicting any additional to heisenbugs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add preempt_count() to rcutorture_one_extend_check() diagnosticsPaul E. McKenney
This commit adds the value of preempt_count() to the diagnostics produced by rcutorture_one_extend_check() to improve debugging. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add parameters to control polled/conditional wait intervalPaul E. McKenney
This commit adds rcutorture module parameters gp_cond_wi, gp_cond_wi_exp, gp_poll_wi, and gp_poll_wi_exp to control the wait interval for conditional, conditional expedited, polled, and polled expedited grace periods, respectively. When rcu_torture_writer() is testing these types of grace periods, hrtimers are used to randomly wait up to the specified number of microseconds, but with nanosecond granularity. In the case of conditional grace periods (get_state_synchronize_rcu() and cond_synchronize_rcu(), for example) there is just one wait. For polled grace periods (start_poll_synchronize_rcu() and poll_state_synchronize_rcu(), for example), there is a repeated series of waits until the grace period ends. For normal grace periods, the default is 16 jiffies (for example, 16,000 microseconds on a HZ=1000 system) and for expedited grace periods the default is 128 microseconds. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Ignore attempts to test preemption and forward progressPaul E. McKenney
Use of the rcutorture preempt_duration and the default-on fwd_progress kernel parameters can result in preemption of callback processing during forward-progress testing, which is an excellent way to OOM your test if your kernel offloads RCU callbacks. This commit therefore treats preempt_duration in the same way as stall_cpu in CONFIG_RCU_NOCB_CPU=y kernels, prohibiting fwd_progress testing and splatting when rcutorture is built in (as opposed to being a loadable module). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Make rcutorture_one_extend() check reader statePaul E. McKenney
This commit adds reader-state debugging checks to a new function named rcutorture_one_extend_check(), which is invoked before and after setting new reader states by the existing rcutorture_one_extend() function. These checks have proven to be rather heavyweight, reducing reproduction rate of some failures by a factor of two. They are therefore hidden behind a new RCU_TORTURE_TEST_CHK_RDR_STATE Kconfig option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Pretty-print rcutorture reader segmentsPaul E. McKenney
The current "Failure/close-call rcutorture reader segments" output is good and sufficient, but annoying when you have to interpret several tens of them after an all-night rcutorture run. This commit therefore makes them a bit more human-readable. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add full read-side contexts to "busted" torture typePaul E. McKenney
The purpose of the "busted" torture type is to test rcutorture code paths used only when a too-short grace period is detected. Currently, "busted" only uses normal rcu_read_lock()-style readers, which fails to exercise much of the "Failure/close-call rcutorture reader segments" functionality. This commit therefore sets the .extendables field of rcu_busted_ops to RCUTORTURE_MAX_EXTEND in order to more fully exercise the reporting. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Decorate failing reader segments with last CPU IDPaul E. McKenney
In kernels built with CONFIG_RCU_TORTURE_TEST_LOG_CPU=y, the CPU is logged at the beginning of each reader segment. This commit further logs it at the end of the full set of reader segments in order to show any migration that might have occurred during the last reader segment. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Check preemption for failing readerPaul E. McKenney
This commit checks to see if the RCU reader has been preempted within its read-side critical section for RCU flavors supporting this notion (currently only preemptible RCU). If such a preemption occurred, then this is printed at the end of the "Failure/close-call rcutorture reader segments" list at the end of the rcutorture run. [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add ->cond_sync_exp_full function to rcu_ops structurePaul E. McKenney
The rcu_ops structure currently lacks a ->cond_sync_exp_full function, which prevents testign of conditional full-state polled grace periods. This commit therefore adds them, enabling testing this option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Use finer-grained timeouts for rcu_torture_writer() pollingPaul E. McKenney
The rcu_torture_writer() polling currently uses timeouts ranging from zero to 16 milliseconds to wait for the polled grace period to end. This works, but it would be better to have a higher probability of exercising races with the code that cleans up after a grace period. This commit therefore switches from these millisecond-scale timeouts to timeouts ranging from zero to 128 microseconds, and with a full microsecond's worth of timeout fuzz. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Decorate failing reader segments with CPU IDPaul E. McKenney
This commit adds CPU number to the "Failure/close-call rcutorture reader segments" list printed at the end of an rcutorture run that had too-short grace periods. This information can help debugging interactions with migration and CPU hotplug. However, experience indicates that sampling the CPU number in rcutorture's read-side code can reduce the probability of too-short bugs by a small integer factor. And small integer factors are crucial to RCU bug hunting, so this commit also introduces a default-off RCU_TORTURE_TEST_LOG_CPU Kconfig option to enable this CPU-number-logging functionality at build time. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14rcutorture: Add random real-time preemptionPaul E. McKenney
This commit adds the rcutorture.preempt_duration kernel module parameter, which gives the real-time preemption duration in milliseconds (zero to disable, which is the default) and also the rcutorture.preempt_interval module parameter, which gives the interval between successive preemptions, also in milliseconds, defaulting to one second. The CPU to preempt is chosen at random from those online at that time. Races between preempting a given CPU and that CPU going offline are ignored, and preemption is forgone when this occurs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14torture: Add dowarn argument to torture_sched_setaffinity()Paul E. McKenney
Current use cases of torture_sched_setaffinity() are well served by its unconditional warning on error. However, an upcoming use case for a preemption kthread needs to avoid warnings that might otherwise arise when that kthread attempted to bind itself to a CPU on its way offline. This commit therefore adds a dowarn argument that, when false, suppresses the warning. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14refscale: Add test for sched_clock()Paul E. McKenney
This commit adds a "sched-clock" test for the sched_clock() function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-12-14pidfs: rework inode number allocationChristian Brauner
Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2). A crucical step in the patch series is how to go from inode number to struct pid without leaking information into unprivileged contexts. The issue is that in order to find a struct pid the pid number in the initial pid namespace must be encoded into the file handle via name_to_handle_at(2). This can be used by containers using a separate pid namespace to learn what the pid number of a given process in the initial pid namespace is. While this is a weak information leak it could be used in various exploits and in general is an ugly wart in the design. To solve this problem a new way is needed to lookup a struct pid based on the inode number allocated for that struct pid. The other part is to remove the custom inode number allocation on 32bit systems that is also an ugly wart that should go away. So, a new scheme is used that I was discusssing with Tejun some time back. A cyclic ida is used for the lower 32 bits and a the high 32 bits are used for the generation number. This gives a 64 bit inode number that is unique on both 32 bit and 64 bit. The lower 32 bit number is recycled slowly and can be used to lookup struct pids. Link: https://lore.kernel.org/r/20241129-work-pidfs-v2-1-61043d66fbce@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-13bpf: Augment raw_tp arguments with PTR_MAYBE_NULLKumar Kartikeya Dwivedi
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL check branch to be dead code eliminated. A previous attempt [1], i.e. the second fixed commit, was made to simulate symbolic execution as if in most accesses, the argument is a non-NULL raw_tp, except for conditional jumps. This tried to suppress branch prediction while preserving compatibility, but surfaced issues with production programs that were difficult to solve without increasing verifier complexity. A more complete discussion of issues and fixes is available at [2]. Fix this by maintaining an explicit list of tracepoints where the arguments are known to be NULL, and mark the positional arguments as PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments are known to be ERR_PTR, and mark these arguments as scalar values to prevent potential dereference. Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2), shifted by the zero-indexed argument number x 4. This can be represented as follows: 1st arg: 0x1 2nd arg: 0x10 3rd arg: 0x100 ... and so on (likewise for ERR_PTR case). In the future, an automated pass will be used to produce such a list, or insert __nullable annotations automatically for tracepoints. Each compilation unit will be analyzed and results will be collated to find whether a tracepoint pointer is definitely not null, maybe null, or an unknown state where verifier conservatively marks it PTR_MAYBE_NULL. A proof of concept of this tool from Eduard is available at [3]. Note that in case we don't find a specification in the raw_tp_null_args array and the tracepoint belongs to a kernel module, we will conservatively mark the arguments as PTR_MAYBE_NULL. This is because unlike for in-tree modules, out-of-tree module tracepoints may pass NULL freely to the tracepoint. We don't protect against such tracepoints passing ERR_PTR (which is uncommon anyway), lest we mark all such arguments as SCALAR_VALUE. While we are it, let's adjust the test raw_tp_null to not perform dereference of the skb->mark, as that won't be allowed anymore, and make it more robust by using inline assembly to test the dead code elimination behavior, which should still stay the same. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"Kumar Kartikeya Dwivedi
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly. However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set. More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway. Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit. [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com Reported-by: Manu Bretelle <chantra@meta.com> Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13bpf: Fix configuration-dependent BTF function referencesThomas Weißschuh
These BTF functions are not available unconditionally, only reference them when they are available. Avoid the following build warnings: BTF .tmp_vmlinux1.btf.o btf_encoder__tag_kfunc: failed to find kfunc 'bpf_send_signal_task' in BTF btf_encoder__tag_kfuncs: failed to tag kfunc 'bpf_send_signal_task' NM .tmp_vmlinux1.syms KSYMS .tmp_vmlinux1.kallsyms.S AS .tmp_vmlinux1.kallsyms.o LD .tmp_vmlinux2 NM .tmp_vmlinux2.syms KSYMS .tmp_vmlinux2.kallsyms.S AS .tmp_vmlinux2.kallsyms.o LD vmlinux BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol prog_test_ref_kfunc WARN: resolve_btfids: unresolved symbol bpf_crypto_ctx WARN: resolve_btfids: unresolved symbol bpf_send_signal_task WARN: resolve_btfids: unresolved symbol bpf_modify_return_test_tp WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_xdp WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_skb Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213-bpf-cond-ids-v1-1-881849997219@weissschuh.net
2024-12-13bpf: Add fd_array_cnt attribute for prog_loadAnton Protopopov
The fd_array attribute of the BPF_PROG_LOAD syscall may contain a set of file descriptors: maps or btfs. This field was introduced as a sparse array. Introduce a new attribute, fd_array_cnt, which, if present, indicates that the fd_array is a continuous array of the corresponding length. If fd_array_cnt is non-zero, then every map in the fd_array will be bound to the program, as if it was used by the program. This functionality is similar to the BPF_PROG_BIND_MAP syscall, but such maps can be used by the verifier during the program load. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-5-aspsk@isovalent.com
2024-12-13bpf: Refactor check_pseudo_btf_idAnton Protopopov
Introduce a helper to add btfs to the env->used_maps array. Use it to simplify the check_pseudo_btf_id() function. This new helper will also be re-used in a consequent patch. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-4-aspsk@isovalent.com
2024-12-13bpf: Move map/prog compatibility checksAnton Protopopov
Move some inlined map/prog compatibility checks from the resolve_pseudo_ldimm64() function to the dedicated check_map_prog_compatibility() function. Call the latter function from the add_used_map_from_fd() function directly. This simplifies code and optimizes logic a bit, as before these changes the check_map_prog_compatibility() function was executed on every map usage, which doesn't make sense, as it doesn't include any per-instruction checks, only map type vs. prog type. (This patch also simplifies a consequent patch which will call the add_used_map_from_fd() function from another code path.) Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-3-aspsk@isovalent.com
2024-12-13bpf: Add a __btf_get_by_fd helperAnton Protopopov
Add a new helper to get a pointer to a struct btf from a file descriptor. This helper doesn't increase a refcnt. Add a comment explaining this and pointing to a corresponding function which does take a reference. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241213130934.1087929-2-aspsk@isovalent.com
2024-12-13sched_ext: Use sizeof_field for key_len in dsq_hash_paramsLiang Jie
Update the `dsq_hash_params` initialization to use `sizeof_field` for the `key_len` field instead of a hardcoded value. This improves code readability and ensures the key length dynamically matches the size of the `id` field in the `scx_dispatch_q` structure. Signed-off-by: Liang Jie <liangjie@lixiang.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-13sched/dlserver: Fix dlserver time accountingVineeth Pillai (Google)
dlserver time is accounted when: - dlserver is active and the dlserver proxies the cfs task. - dlserver is active but deferred and cfs task runs after being picked through the normal fair class pick. dl_server_update is called in two places to make sure that both the above times are accounted for. But it doesn't check if dlserver is active or not. Now that we have this dl_server_active flag, we can consolidate dl_server_update into one place and all we need to check is whether dlserver is active or not. When dlserver is active there is only two possible conditions: - dlserver is deferred. - cfs task is running on behalf of dlserver. Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org