summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-02-04workqueue: make wq_subsys constRicardo B. Marliere
Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-04workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending()Tejun Heo
dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: dd6c3c544126 ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")
2024-02-02bpf: don't emit warnings intended for global subprogs for static subprogsAndrii Nakryiko
When btf_prepare_func_args() was generalized to handle both static and global subprogs, a few warnings/errors that are meant only for global subprog cases started to be emitted for static subprogs, where they are sort of expected and irrelavant. Stop polutting verifier logs with irrelevant scary-looking messages. Fixes: e26080d0da87 ("bpf: prepare btf_prepare_func_args() for handling static subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logicAndrii Nakryiko
Add PTR_TRUSTED | PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID registers into global functions accepting `__arg_trusted __arg_nullable` arguments. This hasn't been caught earlier because tests were either passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR) registers. When utilizing this functionality in complicated real-world BPF application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent that verifier rejects valid case because check_reg_type() doesn't handle this case explicitly. Existing check_reg_type() logic is already anticipating this combination, so we just need to explicitly list this combo in the switch statement. Fixes: e2b3c4ff5d18 ("bpf: add __arg_trusted global func arg tag") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-02Merge tag 'trace-v6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and eventfs fixes from Steven Rostedt: - Fix the return code for ring_buffer_poll_wait() It was returing a -EINVAL instead of EPOLLERR. - Zero out the tracefs_inode so that all fields are initialized. The ti->private could have had stale data, but instead of just initializing it to NULL, clear out the entire structure when it is allocated. - Fix a crash in timerlat The hrtimer was initialized at read and not open, but is canceled at close. If the file was opened and never read the close will pass a NULL pointer to hrtime_cancel(). - Rewrite of eventfs. Linus wrote a patch series to remove the dentry references in the eventfs_inode and to use ref counting and more of proper VFS interfaces to make it work. - Add warning to put_ei() if ei is not set to free. That means something is about to free it when it shouldn't. - Restructure the eventfs_inode to make it more compact, and remove the unused llist field. - Remove the fsnotify*() funtions for when the inodes were being created in the lookup code. It doesn't make sense to notify about creation just because something is being looked up. - The inode hard link count was not accurate. It was being updated when a file was looked up. The inodes of directories were updating their parent inode hard link count every time the inode was created. That means if memory reclaim cleaned a stale directory inode and the inode was lookup up again, it would increment the parent inode again as well. Al Viro said to just have all eventfs directories have a hard link count of 1. That tells user space not to trust it. * tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Keep all directory links at 1 eventfs: Remove fsnotify*() functions from lookup() eventfs: Restructure eventfs_inode structure to be more condensed eventfs: Warn if an eventfs_inode is freed without is_freed being set tracing/timerlat: Move hrtimer_init to timerlat_fd open() eventfs: Get rid of dentry pointers without refcounts eventfs: Clean up dentry ops and add revalidate function eventfs: Remove unused d_parent pointer field tracefs: dentry lookup crapectomy tracefs: Avoid using the ei->dentry pointer unnecessarily eventfs: Initialize the tracefs inode properly tracefs: Zero out the tracefs_inode when allocating it ring-buffer: Clean ring_buffer_poll_wait() error return
2024-02-02bpf: Handle scalar spill vs all MISC in stacksafe()Eduard Zingerman
When check_stack_read_fixed_off() reads value from an spi all stack slots of which are set to STACK_{MISC,INVALID}, the destination register is set to unbound SCALAR_VALUE. Exploit this fact by allowing stacksafe() to use a fake unbound scalar register to compare 'mmmm mmmm' stack value in old state vs spilled 64-bit scalar in current state and vice versa. Veristat results after this patch show some gains: ./veristat -C -e file,prog,states -f 'states_pct>10' not-opt after File Program States (DIFF) ----------------------- --------------------- --------------- bpf_overlay.o tail_rev_nodeport_lb4 -45 (-15.85%) bpf_xdp.o tail_lb_ipv4 -541 (-19.57%) pyperf100.bpf.o on_event -680 (-10.42%) pyperf180.bpf.o on_event -2164 (-19.62%) pyperf600.bpf.o on_event -9799 (-24.84%) strobemeta.bpf.o on_event -9157 (-65.28%) xdp_synproxy_kern.bpf.o syncookie_tc -54 (-19.29%) xdp_synproxy_kern.bpf.o syncookie_xdp -74 (-24.50%) Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com
2024-02-02bpf: Preserve boundaries and track scalars on narrowing fillMaxim Mikityanskiy
When the width of a fill is smaller than the width of the preceding spill, the information about scalar boundaries can still be preserved, as long as it's coerced to the right width (done by coerce_reg_to_size). Even further, if the actual value fits into the fill width, the ID can be preserved as well for further tracking of equal scalars. Implement the above improvements, which makes narrowing fills behave the same as narrowing spills and MOVs between registers. Two tests are adjusted to accommodate for endianness differences and to take into account that it's now allowed to do a narrowing fill from the least significant bits. reg_bounds_sync is added to coerce_reg_to_size to correctly adjust umin/umax boundaries after the var_off truncation, for example, a 64-bit value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax = 0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com
2024-02-02bpf: Track spilled unbounded scalarsMaxim Mikityanskiy
Support the pattern where an unbounded scalar is spilled to the stack, then boundary checks are performed on the src register, after which the stack frame slot is refilled into a register. Before this commit, the verifier didn't treat the src register and the stack slot as related if the src register was an unbounded scalar. The register state wasn't copied, the id wasn't preserved, and the stack slot was marked as STACK_MISC. Subsequent boundary checks on the src register wouldn't result in updating the boundaries of the spilled variable on the stack. After this commit, the verifier will preserve the bond between src and dst even if src is unbounded, which permits to do boundary checks on src and refill dst later, still remembering its boundaries. Such a pattern is sometimes generated by clang when compiling complex long functions. One test is adjusted to reflect that now unbounded scalars are tracked. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com
2024-02-02modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabledChristophe Leroy
Now that rodata_enabled is declared at all time, the #ifdef CONFIG_STRICT_MODULE_RWX can be removed. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-02-02pid: kill the obsolete PIDTYPE_PID code in transfer_pid()Oleg Nesterov
transfer_pid() must be never called with pid == PIDTYPE_PID, new_leader->thread_pid should be changed by exchange_tids(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02pidfd_poll: report POLLHUP when pid_task() == NULLOleg Nesterov
Add another wake_up_all(wait_pidfd) into __change_pid() and change pidfd_poll() to include EPOLLHUP if task == NULL. This allows to wait until the target process/thread is reaped. TODO: change do_notify_pidfd() to use the keyed wakeups. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02pidfd: implement PIDFD_THREAD flag for pidfd_open()Oleg Nesterov
With this flag: - pidfd_open() doesn't require that the target task must be a thread-group leader - pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty. This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; pidfd_poll() can succeed or not depending on whether pidfd_task_exited() is called before or after exchange_tids(). Perhaps we can improve this behaviour later, pidfd_poll() can probably take sig->group_exec_task into account. But this doesn't really differ from the case when the leader exits before other threads (so pidfd_poll() succeeds) and then another thread execs and pidfd_poll() will block again. thread_group_exited() is no longer used, perhaps it can die. Co-developed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com Tested-by: Tycho Andersen <tandersen@netflix.com> Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02pidfd: don't do_notify_pidfd() if !thread_group_empty()Oleg Nesterov
do_notify_pidfd() makes no sense until the whole thread group exits, change do_notify_parent() to check thread_group_empty(). This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02pidfd: cleanup the usage of __pidfd_prepare's flagsOleg Nesterov
- make pidfd_create() static. - Don't pass O_RDWR | O_CLOEXEC to __pidfd_prepare() in copy_process(), __pidfd_prepare() adds these flags unconditionally. - Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the flags itself, all other users of pidfd_prepare() pass flags = 0. If we need a sanity check for those other in kernel users then WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense. - Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except O_CLOEXEC. - Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except O_ACCMODE | O_NONBLOCK. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-02fork: Using clone_flags for legacy clone checkWang Jinchao
In the current implementation of clone(), there is a line that initializes `u64 clone_flags = args->flags` at the top. This means that there is no longer a need to use args->flags for the legacy clone check. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-01cap_syslog: remove CAP_SYS_ADMIN when dmesg_restrictJingzi Meng
CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux 2.6.37 (2010-11). For a long time, certain syslog actions required CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove CAP_SYS_ADMIN for more fine-grained control. CAP_SYS_ADMIN was once removed but added back for backwards compatibility reasons. In commit 38ef4c2e437d ("syslog: check cap_syslog when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And in commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now") (2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been preserved. Now that almost 13 years have passed, the legacy application may have had enough time to be updated. Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn Signed-off-by: Kees Cook <keescook@chromium.org>
2024-02-01bpf: Minor clean-up to sleepable_lsm_hooks BTF setMatt Bobrowski
There's already one main CONFIG_SECURITY_NETWORK ifdef block within the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef block as there's no need for it and all things guarded by it should remain in one place in this specific context. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com
2024-02-01tracing/timerlat: Move hrtimer_init to timerlat_fd open()Daniel Bristot de Oliveira
Currently, the timerlat's hrtimer is initialized at the first read of timerlat_fd, and destroyed at close(). It works, but it causes an error if the user program open() and close() the file without reading. Here's an example: # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options # echo timerlat > /sys/kernel/debug/tracing/current_tracer # cat <<EOF > ./timerlat_load.py # !/usr/bin/env python3 timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r') timerlat_fd.close(); EOF # ./taskset -c 0 ./timerlat_load.py <BOOM> BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:hrtimer_active+0xd/0x50 Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286 RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08 RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08 R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000 FS: 00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0 PKRU: 55555554 Call Trace: <TASK> ? __die+0x23/0x70 ? page_fault_oops+0x171/0x4e0 ? srso_alias_return_thunk+0x5/0x7f ? avc_has_extended_perms+0x237/0x520 ? exc_page_fault+0x7f/0x180 ? asm_exc_page_fault+0x26/0x30 ? hrtimer_active+0xd/0x50 hrtimer_cancel+0x15/0x40 timerlat_fd_release+0x48/0xe0 __fput+0xf5/0x290 __x64_sys_close+0x3d/0x80 do_syscall_64+0x60/0x90 ? srso_alias_return_thunk+0x5/0x7f ? __x64_sys_ioctl+0x72/0xd0 ? srso_alias_return_thunk+0x5/0x7f ? syscall_exit_to_user_mode+0x2b/0x40 ? srso_alias_return_thunk+0x5/0x7f ? do_syscall_64+0x6c/0x90 ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x142/0x1f0 ? srso_alias_return_thunk+0x5/0x7f ? syscall_exit_to_user_mode+0x2b/0x40 ? srso_alias_return_thunk+0x5/0x7f ? do_syscall_64+0x6c/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f2ffb321594 Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000 R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003 R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668 </TASK> CR2: 0000000000000010 ---[ end trace 0000000000000000 ]--- Move hrtimer_init to timerlat_fd open() to avoid this problem. Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-31bpf: treewide: Annotate BPF kfuncs in BTFDaniel Xu
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, use BTF_KFUNCS_START/END macro pair to mark kfuncs 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_KFUNCS sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we now also return error inside kfunc registration for untagged kfuncs. For vmlinux kfuncs, we also WARN(), as initcall machinery does not handle errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Benjamin Tissoires <bentiss@kernel.org> Link: https://lore.kernel.org/r/e55150ceecbf0a5d961e608941165c0bee7bc943.1706491398.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-31ring-buffer: Clean ring_buffer_poll_wait() error returnVincent Donnefort
The return type for ring_buffer_poll_wait() is __poll_t. This is behind the scenes an unsigned where we can set event bits. In case of a non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us, this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so user-space at least is aware something went wrong. Nonetheless, this is an incorrect code. Replace that -EINVAL with a proper EPOLLERR to clean that output. As this doesn't change the behaviour, there's no need to treat this change as a bug fix. Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com Cc: stable@vger.kernel.org Fixes: 6721cb6002262 ("ring-buffer: Do not poll non allocated cpu buffers") Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-30workqueue: Avoid premature init of wq->node_nr_active[].maxTejun Heo
System workqueues are allocated early during boot from workqueue_init_early(). While allocating unbound workqueues, wq_update_node_max_active() is invoked from apply_workqueue_attrs() and accesses NUMA topology to initialize wq->node_nr_active[].max. However, topology information may not be set up at this point. wq_update_node_max_active() is explicitly invoked from workqueue_init_topology() later when topology information is known to be available. This doesn't seem to crash anything but it's doing useless work with dubious data. Let's skip the premature and duplicate node_max_active updates by initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making wq_update_node_max_active() noop until workqueue_init_topology(). Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/workqueue.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 9221a4c57ae1..a65081ec6780 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -386,6 +386,8 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = { [WQ_AFFN_SYSTEM] = "system", }; +static bool wq_topo_initialized = false; + /* * Per-cpu work items which run for longer than the following threshold are * automatically considered CPU intensive and excluded from concurrency @@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct *wq, int off_cpu) lockdep_assert_held(&wq->mutex); + if (!wq_topo_initialized) + return; + if (!cpumask_test_cpu(off_cpu, effective)) off_cpu = -1; @@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active **nna_ar) static void init_node_nr_active(struct wq_node_nr_active *nna) { + nna->max = WQ_DFL_MIN_ACTIVE; atomic_set(&nna->nr, 0); raw_spin_lock_init(&nna->lock); INIT_LIST_HEAD(&nna->pending_pwqs); @@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void) init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache); init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa); + wq_topo_initialized = true; + mutex_lock(&wq_pool_mutex); /*
2024-01-30workqueue: Don't call cpumask_test_cpu() with -1 CPU in ↵Tejun Heo
wq_update_node_max_active() For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is going down. The function was incorrectly calling cpumask_test_cpu() with -1 CPU leading to oopses like the following on some archs: Unable to handle kernel paging request at virtual address ffff0002100296e0 .. pc : wq_update_node_max_active+0x50/0x1fc lr : wq_update_node_max_active+0x1f0/0x1fc ... Call trace: wq_update_node_max_active+0x50/0x1fc apply_wqattrs_commit+0xf0/0x114 apply_workqueue_attrs_locked+0x58/0xa0 alloc_workqueue+0x5ac/0x774 workqueue_init_early+0x460/0x540 start_kernel+0x258/0x684 __primary_switched+0xb8/0xc0 Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com Fixes: 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")
2024-01-30bpf: add arg:nullable tag to be combined with trusted pointersAndrii Nakryiko
Add ability to mark arg:trusted arguments with optional arg:nullable tag to mark it as PTR_TO_BTF_ID_OR_NULL variant, which will allow callers to pass NULL, and subsequently will force global subprog's code to do NULL check. This allows to have "optional" PTR_TO_BTF_ID values passed into global subprogs. For now arg:nullable cannot be combined with anything else. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240130000648.2144827-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-30bpf: add __arg_trusted global func arg tagAndrii Nakryiko
Add support for passing PTR_TO_BTF_ID registers to global subprogs. Currently only PTR_TRUSTED flavor of PTR_TO_BTF_ID is supported. Non-NULL semantics is assumed, so caller will be forced to prove PTR_TO_BTF_ID can't be NULL. Note, we disallow global subprogs to destroy passed in PTR_TO_BTF_ID arguments, even the trusted one. We achieve that by not setting ref_obj_id when validating subprog code. This basically enforces (in Rust terms) borrowing semantics vs move semantics. Borrowing semantics seems to be a better fit for isolated global subprog validation approach. Implementation-wise, we utilize existing logic for matching user-provided BTF type to kernel-side BTF type, used by BPF CO-RE logic and following same matching rules. We enforce a unique match for types. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240130000648.2144827-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-29bpf,token: Use BIT_ULL() to convert the bit maskHaiyue Wang
Replace the '(1ULL << *)' with the macro BIT_ULL(nr). Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240127134901.3698613-1-haiyue.wang@intel.com
2024-01-29Merge tag 'trace-v6.8-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Two small fixes for tracefs and eventfs: - Fix register_snapshot_trigger() on allocation error If the snapshot fails to allocate, the register_snapshot_trigger() can still return success. If the call to tracing_alloc_snapshot_instance() returned anything but 0, it returned 0, but it should have been returning the error code from that allocation function. - Remove leftover code from tracefs doing a dentry walk on remount. The update_gid() function was called by the tracefs code on remount to update the gid of eventfs, but that is no longer the case, but that code wasn't deleted. Nothing calls it. Remove it" * tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracefs: remove stale 'update_gid' code tracing/trigger: Fix to return error if failed to alloc snapshot
2024-01-29workqueue: Avoid using isolated cpus' timers on queue_delayed_workLeonardo Bras
When __queue_delayed_work() is called, it chooses a cpu for handling the timer interrupt. As of today, it will pick either the cpu passed as parameter or the last cpu used for this. This is not good if a system does use CPU isolation, because it can take away some valuable cpu time to: 1 - deal with the timer interrupt, 2 - schedule-out the desired task, 3 - queue work on a random workqueue, and 4 - schedule the desired task back to the cpu. So to fix this, during __queue_delayed_work(), if cpu isolation is in place, pick a random non-isolated cpu to handle the timer interrupt. As an optimization, if the current cpu is not isolated, use it instead of looking for another candidate. Signed-off-by: Leonardo Bras <leobras@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-29Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "22 hotfixes. 11 are cc:stable and the remainder address post-6.7 issues or aren't considered appropriate for backporting" * tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) mm: thp_get_unmapped_area must honour topdown preference mm: huge_memory: don't force huge page alignment on 32 bit userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory scs: add CONFIG_MMU dependency for vfree_atomic() mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range() mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty() selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag selftests: mm: fix map_hugetlb failure on 64K page size systems MAINTAINERS: supplement of zswap maintainers update stackdepot: make fast paths lock-less again stackdepot: add stats counters exported via debugfs mm, kmsan: fix infinite recursion due to RCU critical section mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again selftests/mm: switch to bash from sh MAINTAINERS: add man-pages git trees mm: memcontrol: don't throttle dying tasks on memory.high mm: mmap: map MAP_STACK to VM_NOHUGEPAGE uprobes: use pagesize-aligned virtual address when replacing pages selftests/mm: mremap_test: fix build warning ...
2024-01-29perf/bpf: Fix duplicate type checkFlorian Lehner
Remove the duplicate check on type and unify result. Signed-off-by: Florian Lehner <dev@der-flo.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20240120150920.3370-1-dev@der-flo.net
2024-01-29bpf: move arg:ctx type enforcement check inside the main logic loopAndrii Nakryiko
Now that bpf and bpf-next trees converged and we don't run the risk of merge conflicts, move btf_validate_prog_ctx_type() into its most logical place inside the main logic loop. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240125205510.3642094-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-29module: Change module_enable_{nx/x/ro}() to more explicit namesChristophe Leroy
It's a bit puzzling to see a call to module_enable_nx() followed by a call to module_enable_x(). This is because one applies on text while the other applies on data. Change name to make that more clear. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-01-29module: Use set_memory_rox()Christophe Leroy
A couple of architectures seem concerned about calling set_memory_ro() and set_memory_x() too frequently and have implemented a version of set_memory_rox(), see commit 60463628c9e0 ("x86/mm: Implement native set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement set_memory_rox()") Use set_memory_rox() in modules when STRICT_MODULES_RWX is set. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-01-29workqueue: Implement system-wide nr_active enforcement for unbound workqueuesTejun Heo
A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools. In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low. However, cache layouts are more complex now and sharing a worker pool across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on 8639ecebc9b1 ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues. While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs. 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3. Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active. Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are: - One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty. - Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines. - Per-pwq enforcement had been more or less okay while we were using per-node pools. It looks like we no longer can avoid decoupling max_active enforcement from pool boundaries. This patch implements system-wide nr_active mechanism with the following design characteristics: - To avoid sharing a single counter across multiple nodes, the configured max_active is split across nodes according to the proportion of each workqueue's online effective CPUs per node. e.g. A node with twice more online effective CPUs will get twice higher portion of max_active. - Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the nodes. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a node regardless of how max_active distribution comes out to be. It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_weight than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8. I believe these should be fine given that the number of CPUs in each NUMA node is usually higher than 8 and work item chain longer than 8 is pretty unlikely. However, if these assumptions turn out to be wrong, we'll need to add an interface to adjust min_active. - Each unbound wq has an array of struct wq_node_nr_active which tracks per-node nr_active. When its pwq wants to run a work item, it has to obtain the matching node's nr_active. If over the node's max_active, the pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish, the completion path round-robins the pending pwqs activating the first inactive work item of each, which involves some pool lock dancing and kicking other pools. It's not the simplest code but doesn't look too bad. v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active(). - wq_adjust_max_active() is now protected by wq->mutex instead of wq_pool_mutex. v3: - wq_node_max_active() used to calculate per-node max_active on the fly based on system-wide CPU online states. Lai pointed out that this can lead to skewed distributions for workqueues with restricted cpumasks. Update the max_active distribution to use per-workqueue effective online CPU counts instead of system-wide and cache the calculation results in node_nr_active->max. v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Introduce struct wq_node_nr_activeTejun Heo
Currently, for both percpu and unbound workqueues, max_active applies per-cpu, which is a recent change for unbound workqueues. The change for unbound workqueues was a significant departure from the previous behavior of per-node application. It made some use cases create undesirable number of concurrent work items and left no good way of fixing them. To address the problem, workqueue is implementing a NUMA node segmented global nr_active mechanism, which will be explained further in the next patch. As a preparation, this patch introduces struct wq_node_nr_active. It's a data structured allocated for each workqueue and NUMA node pair and currently only tracks the workqueue's number of active work items on the node. This is split out from the next patch to make it easier to understand and review. Note that there is an extra wq_node_nr_active allocated for the invalid node nr_node_ids which is used to track nr_active for pools which don't have NUMA node associated such as the default fallback system-wide pool. This doesn't cause any behavior changes visible to userland yet. The next patch will expand to implement the control mechanism on top. v4: - Fixed out-of-bound access when freeing per-cpu workqueues. v3: - Use flexible array for wq->node_nr_active as suggested by Lai. v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped pwq->max_active check. Restored. As the next patch replaces the max_active enforcement mechanism, this doesn't change the end result. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Move pwq_dec_nr_in_flight() to the end of work item handlingTejun Heo
The planned shared nr_active handling for unbound workqueues will make pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire other pool locks, which is necessary as retirement of an nr_active count from one pool may need kick off an inactive work item in another pool. This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the end of work item handling so that work item state changes stay atomic. process_one_work() which is the other user of pwq_dec_nr_in_flight() already calls it at the end of work item handling. Comments are added to both call sites and pwq_dec_nr_in_flight(). This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: RCU protect wq->dfl_pwq and implement accessors for itTejun Heo
wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq which doesn't require RCU access. However, we want to be able to access wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code can be made easier to read by making the two pwq fields behave in the same way. - Make wq->dfl_pwq RCU protected. - Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq and ->cpu_pwq. The former returns the double pointer that can be used access and update the pwqs. The latter performs locking check and dereferences the double pointer. - pwq accesses and updates are converted to use unbound_pwq[_slot](). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Make wq_adjust_max_active() round-robin pwqs while activatingTejun Heo
wq_adjust_max_active() needs to activate work items after max_active is increased. Previously, it did that by visiting each pwq once activating all that could be activated. While this makes sense with per-pwq nr_active, nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd want to round-robin through pwqs to be fairer. In preparation, this patch makes wq_adjust_max_active() round-robin pwqs while activating. While the activation ordering changes, this shouldn't cause user-noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Move nr_active handling into helpersTejun Heo
__queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were open-coding nr_active handling, which is fine given that the operations are trivial. However, the planned unbound nr_active update will make them more complicated, so let's move them into helpers. - pwq_tryinc_nr_active() is added. It increments nr_active if under max_active limit and return a boolean indicating whether inc was successful. Note that the function is structured to accommodate future changes. __queue_work() is updated to use the new helper. - pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and thus no longer assumes that nr_active is under max_active and returns a boolean to indicate whether a work item has been activated. - wq_adjust_max_active() no longer tests directly whether a work item can be activated. Instead, it's updated to use the return value of pwq_activate_first_inactive() to tell whether a work item has been activated. - nr_active decrement and activating the first inactive work item is factored into pwq_dec_nr_active(). v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as now we're calling the function unconditionally from pwq_activate_first_inactive(). v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work()Tejun Heo
To prepare for unbound nr_active handling improvements, move work activation part of pwq_activate_inactive_work() into __pwq_activate_work() and add pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active. pwq_activate_first_inactive() and try_to_grab_pending() are updated to use pwq_activate_work(). The latter conversion is functionally identical. For the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE testing. This is temporary and will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Factor out pwq_is_empty()Tejun Heo
"!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated multiple times. Let's factor it out into pwq_is_empty(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29workqueue: Move pwq->max_active to wq->max_activeTejun Heo
max_active is a workqueue-wide setting and the configured value is stored in wq->saved_max_active; however, the effective value was stored in pwq->max_active. While this is harmless, it makes max_active update process more complicated and gets in the way of the planned max_active semantic updates for unbound workqueues. This patches moves pwq->max_active to wq->max_active. This simplifies the code and makes freezing and noop max_active updates cheaper too. No user-visible behavior change is intended. As wq->max_active is updated while holding wq mutex but read without any locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is added for it. v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
2024-01-29genirq/irq_sim: Shrink code by using <linux/cleanup.h> helpersBartosz Golaszewski
Use the new __free() mechanism to remove all gotos and simplify the error paths. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240122124243.44002-5-brgl@bgdev.pl
2024-01-28Merge tag 'locking_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Prevent an inconsistent futex operation leading to stale state exposure * tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Prevent the reuse of stale pi_state
2024-01-28Merge tag 'irq_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: - Initialize the resend node of each IRQ descriptor, not only the first one * tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Initialize resend_node hlist for all interrupt descriptors
2024-01-28Merge tag 'timers_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Preserve the number of idle calls and sleep entries across CPU hotplug events in order to be able to compute correct averages - Limit the duration of the clocksource watchdog checking interval as too long intervals lead to wrongly marking the TSC as unstable * tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/sched: Preserve number of idle sleeps across CPU hotplug events clocksource: Skip watchdog check for large watchdog intervals
2024-01-26Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-01-26 We've added 107 non-merge commits during the last 4 day(s) which contain a total of 101 files changed, 6009 insertions(+), 1260 deletions(-). The main changes are: 1) Add BPF token support to delegate a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. With addressed changes from Christian and Linus' reviews, from Andrii Nakryiko. 2) Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type, from Kui-Feng Lee. 3) Add support for retrieval of cookies for perf/kprobe multi links, from Jiri Olsa. 4) Bigger batch of prep-work for the BPF verifier to eventually support preserving boundaries and tracking scalars on narrowing fills, from Maxim Mikityanskiy. 5) Extend the tc BPF flavor to support arbitrary TCP SYN cookies to help with the scenario of SYN floods, from Kuniyuki Iwashima. 6) Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects, from Hou Tao. 7) Extend BPF verifier to track aligned ST stores as imprecise spilled registers, from Yonghong Song. 8) Several fixes to BPF selftests around inline asm constraints and unsupported VLA code generation, from Jose E. Marchesi. 9) Various updates to the BPF IETF instruction set draft document such as the introduction of conformance groups for instructions, from Dave Thaler. 10) Fix BPF verifier to make infinite loop detection in is_state_visited() exact to catch some too lax spill/fill corner cases, from Eduard Zingerman. 11) Refactor the BPF verifier pointer ALU check to allow ALU explicitly instead of implicitly for various register types, from Hao Sun. 12) Fix the flaky tc_redirect_dtime BPF selftest due to slowness in neighbor advertisement at setup time, from Martin KaFai Lau. 13) Change BPF selftests to skip callback tests for the case when the JIT is disabled, from Tiezhu Yang. 14) Add a small extension to libbpf which allows to auto create a map-in-map's inner map, from Andrey Grafin. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (107 commits) selftests/bpf: Add missing line break in test_verifier bpf, docs: Clarify definitions of various instructions bpf: Fix error checks against bpf_get_btf_vmlinux(). bpf: One more maintainer for libbpf and BPF selftests selftests/bpf: Incorporate LSM policy to token-based tests selftests/bpf: Add tests for LIBBPF_BPF_TOKEN_PATH envvar libbpf: Support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar selftests/bpf: Add tests for BPF object load with implicit token selftests/bpf: Add BPF object loading tests with explicit token passing libbpf: Wire up BPF token support at BPF object level libbpf: Wire up token_fd into feature probing logic libbpf: Move feature detection code into its own file libbpf: Further decouple feature checking logic from bpf_object libbpf: Split feature detectors definitions from cached results selftests/bpf: Utilize string values for delegate_xxx mount options bpf: Support symbolic BPF FS delegation mount options bpf: Fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS bpf,selinux: Allocate bpf_security_struct per BPF token selftests/bpf: Add BPF token-enabled tests libbpf: Add BPF token support to bpf_prog_load() API ... ==================== Link: https://lore.kernel.org/r/20240126215710.19855-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-26workqueue: Break up enum definitions and give names to the typesTejun Heo
workqueue is collecting different sorts of enums into a single unnamed enum type which can increase confusion around enum width. Also, unnamed enums can't be accessed from BPF. Let's break up enum definitions according to their purposes and give them type names. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26workqueue: Drop unnecessary kick_pool() in create_worker()Tejun Heo
After creating a new worker, create_worker() is calling kick_pool() to wake up the new worker task. However, as kick_pool() doesn't do anything if there is no work pending, it also calls wake_up_process() explicitly. There's no reason to call kick_pool() at all. wake_up_process() is enough by itself. Drop the unnecessary kick_pool() call. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-01-26tracing/trigger: Fix to return error if failed to alloc snapshotMasami Hiramatsu (Google)
Fix register_snapshot_trigger() to return error code if it failed to allocate a snapshot instead of 0 (success). Unless that, it will register snapshot trigger without an error. Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2 Fixes: 0bbe7f719985 ("tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation") Cc: stable@vger.kernel.org Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>