summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-04-11audit: record AUDIT_ANOM_* events regardless of presence of rulesRichard Guy Briggs
When no audit rules are in place, AUDIT_ANOM_{LINK,CREAT} events reported in audit_log_path_denied() are unconditionally dropped due to an explicit check for the existence of any audit rules. Given this is a report of a security violation, allow it to be recorded regardless of the existence of any audit rules. To test, mkdir -p /root/tmp chmod 1777 /root/tmp touch /root/tmp/test.txt useradd test chown test /root/tmp/test.txt {echo C0644 12 test.txt; printf 'hello\ntest1\n'; printf \\000;} | \ scp -t /root/tmp Check with ausearch -m ANOM_CREAT -ts recent Link: https://issues.redhat.com/browse/RHEL-9065 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-04-11audit: mark audit_log_vformat() with __printf() attributeAndy Shevchenko
audit_log_vformat() is using printf() type of format, and GCC compiler (Debian 14.2.0-17) is not happy about this: kernel/audit.c:1978:9: error: function ‘audit_log_vformat’ might be a candidate for ‘gnu_printf’ format attribute kernel/audit.c:1987:17: error: function ‘audit_log_vformat’ might be a candidate for ‘gnu_printf’ format attribute Fix the compilation errors (`make W=1` when CONFIG_WERROR=y, which is default) by adding __printf() attribute. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [PM: commit description line wrap fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-04-11bpf: Convert ringbuf map to rqspinlockKumar Kartikeya Dwivedi
Convert the raw spinlock used by BPF ringbuf to rqspinlock. Currently, we have an open syzbot report of a potential deadlock. In addition, the ringbuf can fail to reserve spuriously under contention from NMI context. It is potentially attractive to enable unconstrained usage (incl. NMIs) while ensuring no deadlocks manifest at runtime, perform the conversion to rqspinlock to achieve this. This change was benchmarked for BPF ringbuf's multi-producer contention case on an Intel Sapphire Rapids server, with hyperthreading disabled and performance governor turned on. 5 warm up runs were done for each case before obtaining the results. Before (raw_spinlock_t): Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.440 ± 0.019M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 2.706 ± 0.010M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 3.130 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 2.472 ± 0.003M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 2.352 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 2.813 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 1.988 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 2.245 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 2.148 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 2.190 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 32 2.490 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 36 2.180 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 40 2.201 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 44 2.226 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 48 2.164 ± 0.001M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 52 1.874 ± 0.001M/s (drops 0.000 ± 0.000M/s) After (rqspinlock_t): Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 11.078 ± 0.019M/s (drops 0.000 ± 0.000M/s) (-3.16%) rb-libbpf nr_prod 2 2.801 ± 0.014M/s (drops 0.000 ± 0.000M/s) (3.51%) rb-libbpf nr_prod 3 3.454 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.35%) rb-libbpf nr_prod 4 2.567 ± 0.002M/s (drops 0.000 ± 0.000M/s) (3.84%) rb-libbpf nr_prod 8 2.468 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.93%) rb-libbpf nr_prod 12 2.510 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-10.77%) rb-libbpf nr_prod 16 2.075 ± 0.001M/s (drops 0.000 ± 0.000M/s) (4.38%) rb-libbpf nr_prod 20 2.640 ± 0.001M/s (drops 0.000 ± 0.000M/s) (17.59%) rb-libbpf nr_prod 24 2.092 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-2.61%) rb-libbpf nr_prod 28 2.426 ± 0.005M/s (drops 0.000 ± 0.000M/s) (10.78%) rb-libbpf nr_prod 32 2.331 ± 0.004M/s (drops 0.000 ± 0.000M/s) (-6.39%) rb-libbpf nr_prod 36 2.306 ± 0.003M/s (drops 0.000 ± 0.000M/s) (5.78%) rb-libbpf nr_prod 40 2.178 ± 0.002M/s (drops 0.000 ± 0.000M/s) (-1.04%) rb-libbpf nr_prod 44 2.293 ± 0.001M/s (drops 0.000 ± 0.000M/s) (3.01%) rb-libbpf nr_prod 48 2.022 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-6.56%) rb-libbpf nr_prod 52 1.809 ± 0.001M/s (drops 0.000 ± 0.000M/s) (-3.47%) There's a fair amount of noise in the benchmark, with numbers on reruns going up and down by 10%, so all changes are in the range of this disturbance, and we see no major regressions. Reported-by: syzbot+850aaf14624dc0c6d366@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000004aa700061379547e@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250411101759.4061366-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-10trace: tcp: Add tracepoint for tcp_sendmsg_locked()Breno Leitao
Add a tracepoint to monitor TCP send operations, enabling detailed visibility into TCP message transmission. Create a new tracepoint within the tcp_sendmsg_locked function, capturing traditional fields along with size_goal, which indicates the optimal data size for a single TCP segment. Additionally, a reference to the struct sock sk is passed, allowing direct access for BPF programs. The implementation is largely based on David's patch[1] and suggestions. Link: https://lore.kernel.org/all/70168c8f-bf52-4279-b4c4-be64527aa1ac@kernel.org/ [1] Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250408-tcpsendmsg-v3-2-208b87064c28@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc2). Conflict: Documentation/networking/netdevices.rst net/core/lock_debug.c 04efcee6ef8d ("net: hold instance lock during NETDEV_CHANGE") 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-10Merge tag 'timers-urgent-2025-04-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc timer fixes from Ingo Molnar: - Fix missing ACCESS_PRIVATE() that triggered a Sparse warning - Fix lockdep false positive in tick_freeze() on CONFIG_PREEMPT_RT=y - Avoid <vdso/unaligned.h> macro's variable shadowing to address build warning that triggers under W=2 builds * tag 'timers-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: vdso: Address variable shadowing in macros timekeeping: Add a lockdep override in tick_freeze() hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::function
2025-04-10Merge tag 'perf-urgent-2025-04-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc perf events fixes from Ingo Molnar: - Fix __free_event() corner case splat - Fix false-positive uprobes related lockdep splat on CONFIG_PREEMPT_RT=y kernels - Fix a complicated perf sigtrap race that may result in hangs * tag 'perf-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix hang while freeing sigtrap event uprobes: Avoid false-positive lockdep splat on CONFIG_PREEMPT_RT=y in the ri_timer() uprobe timer callback, use raw_write_seqcount_*() perf/core: Fix WARN_ON(!ctx) in __free_event() for partial init
2025-04-10bpf: Convert queue_stack map to rqspinlockKumar Kartikeya Dwivedi
Replace all usage of raw_spinlock_t in queue_stack_maps.c with rqspinlock. This is a map type with a set of open syzbot reports reproducing possible deadlocks. Prior attempt to fix the issues was at [0], but was dropped in favor of this approach. Make sure we return the -EBUSY error in case of possible deadlocks or timeouts, just to make sure user space or BPF programs relying on the error code to detect problems do not break. With these changes, the map should be safe to access in any context, including NMIs. [0]: https://lore.kernel.org/all/20240429165658.1305969-1-sidchintamaneni@gmail.com Reported-by: syzbot+8bdfc2c53fb2b63e1871@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000004c3fc90615f37756@google.com Reported-by: syzbot+252bc5c744d0bba917e1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c80abd0616517df9@google.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250410153142.2064340-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-10bpf: Use architecture provided res_smp_cond_load_acquireKumar Kartikeya Dwivedi
In v2 of rqspinlock [0], we fixed potential problems with WFE usage in arm64 to fallback to a version copied from Ankur's series [1]. This logic was moved into arch-specific headers in v3 [2]. However, we missed using the arch-provided res_smp_cond_load_acquire in commit ebababcd0372 ("rqspinlock: Hardcode cond_acquire loops for arm64") due to a rebasing mistake between v2 and v3 of the rqspinlock series. Fix the typo to fallback to the arm64 definition as we did in v2. [0]: https://lore.kernel.org/bpf/20250206105435.2159977-18-memxor@gmail.com [1]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com [2]: https://lore.kernel.org/bpf/20250303152305.3195648-9-memxor@gmail.com Fixes: ebababcd0372 ("rqspinlock: Hardcode cond_acquire loops for arm64") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250410145512.1876745-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Don't allocate per-cpu extra_elems for fd htabHou Tao
The update of element in fd htab is in-place now, therefore, there is no need to allocate per-cpu extra_elems, just remove it. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Add is_fd_htab() helperHou Tao
Add is_fd_htab() helper to check whether the map is htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Support atomic update for htab of mapsHou Tao
As reported by Cody Haas [1], when there is concurrent map lookup and map update operation in an existing element for htab of maps, the map lookup procedure may return -ENOENT unexpectedly. The root cause is twofold: 1) the update of existing element involves two separated list operation In htab_map_update_elem(), it first inserts the new element at the head of list, then it deletes the old element. Therefore, it is possible a lookup operation has already iterated to the middle of the list when a concurrent update operation begins, and the lookup operation will fail to find the target element. 2) the immediate reuse of htab element. It is more subtle. Even through the lookup operation finds the old element, it is possible that the target element has been removed by a concurrent update operation, and the element has been reused immediately by other update operation which runs on the same CPU as the previous update operation, and the element is inserted into the same bucket list. After these steps above, when the lookup operation tries to compare the key in the old element with the expected key, the match will fail because the key in the old element have been overwritten by other update operation. The two-step update process is relatively straightforward to address. The more challenging aspect is the immediate reuse. As Alexei pointed out: So since 2022 both prealloc and no_prealloc reuse elements. We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP that will use _rcu() flavor of freeing into bpf_ma, but it has to have a strong reason. Given that htab of maps doesn't support special field in value and directly stores the inner map pointer in htab_element, just do in-place update for htab of maps instead of attempting to address the immediate reuse issue. [1]: https://lore.kernel.org/xdp-newbies/CAH7f-ULFTwKdoH_t2SFc5rWCVYLEg-14d1fBYWH2eekudsnTRg@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_placeHou Tao
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place, and add a new percpu argument for the helper to support in-place update for both per-cpu htab and htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Factor out htab_elem_value helper()Hou Tao
All hash maps store map key and map value together. The relative offset of the map value compared to the map key is round_up(key_size, 8). Therefore, factor out a common helper htab_elem_value() to calculate the address of the map value instead of duplicating the logic. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09configs/debug: run and debug PREEMPTStanislav Fomichev
Recent change [0] resulted in a "BUG: using __this_cpu_read() in preemptible" splat [1]. PREEMPT kernels have additional requirements on what can and can not run with/without preemption enabled. Expose those constrains in the debug kernels. 0: https://lore.kernel.org/netdev/20250314120048.12569-2-justin.iurman@uliege.be/ 1: https://lore.kernel.org/netdev/20250402094458.006ba2a7@kernel.org/T/#mbf72641e9d7d274daee9003ef5edf6833201f1bc Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250402172305.1775226-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09bpf: Check link_create.flags parameter for multi_uprobeTao Chen
The link_create.flags are currently not used for multi-uprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd to have an arbitrary value for multi-uprobe, though, as there are existing users (libbpf) relying on this. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
2025-04-09bpf: Check link_create.flags parameter for multi_kprobeTao Chen
The link_create.flags are currently not used for multi-kprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd, on the other hand, to have an arbitrary value for multi-kprobe, as there are existing users (libbpf) relying on this. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
2025-04-09timekeeping: Add a lockdep override in tick_freeze()Sebastian Andrzej Siewior
tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver acquires a spinlock which becomes a sleeping lock on PREEMPT_RT. Lockdep complains about this lock nesting. Add a lockdep override for this special case and a comment explaining why it is okay. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/ Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/
2025-04-09posix-timers: Initialize cache early and move pointer into __timer_dataEric Dumazet
Move posix_timers_cache initialization to posixtimer_init(). At that point the memory subsystem is already up and running. Also move the cache pointer to the __timer_data variable to avoid potential false sharing, since it never was marked as __ro_after_init. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250402133114.253901-1-edumazet@google.com
2025-04-09sched_ext: Make scx_has_op a bitmapTejun Heo
scx_has_op is used to encode which ops are implemented by the BPF scheduler into an array of static_keys. While this saves a bit of branching overhead, that is unlikely to be noticeable compared to the overall cost. As the global static_keys can't work with the planned hierarchical multiple scheduler support, replace the static_key array with a bitmap. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_allow_queued_wakeup static_keyTejun Heo
scx_ops_allow_queued_wakeup is used to encode SCX_OPS_ALLOW_QUEUED_WAKEUP into a static_key. The test is gated behind scx_enabled(), and, even when sched_ext is enabled, is unlikely for the static_key usage to make any meaningful difference. It is made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead test SCX_OPS_ALLOW_QUEUED_WAKEUP directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_cpu_preempt static_keyTejun Heo
scx_ops_cpu_preempt is used to encode whether ops.cpu_acquire/release() are implemented into a static_key. These tests aren't hot enough for static_key usage to make any meaningful difference and are made to use a static_key mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_key and instead use an internal ops flag SCX_OPS_HAS_CPU_PREEMPT to record and test whether ops.cpu_acquire/release() are implemented. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Remove scx_ops_enq_* static_keysTejun Heo
scx_ops_enq_last/exiting/migration_disabled are used to encode the corresponding SCX_OPS_ flags into static_keys. These flags aren't hot enough for static_key usage to make any meaningful difference and are made static_keys mostly because there was no reason not to. However, global static_keys can't work with the planned hierarchical multiple scheduler support. Remove the static_keys and test the ops flags directly. In repeated hackbench runs before and after static_keys removal on an AMD Ryzen 3900X, I couldn't tell any measurable performance difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09sched_ext: Indentation updatesTejun Heo
Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-04-09hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::functionNam Cao
The "function" field of struct hrtimer has been changed to private, but two instances have not been converted to use ACCESS_PRIVATE(). Convert them to use ACCESS_PRIVATE(). Fixes: 04257da0c99c ("hrtimers: Make callback function pointer private") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/
2025-04-09genirq/msi: Rename msi_[un]lock_descs()Thomas Gleixner
Now that all abuse is gone and the legit users are converted to guard(msi_descs_lock), rename the lock functions and document them as internal. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com> Link: https://lore.kernel.org/all/20250319105506.864699741@linutronix.de
2025-04-09genirq/msi: Use lock guards for MSI descriptor lockingThomas Gleixner
Provide a lock guard for MSI descriptor locking and update the core code accordingly. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/all/20250319105506.144672678@linutronix.de
2025-04-09PM: hibernate: Remove size arguments when calling strscpy()Thorsten Blum
The size parameter is optional and strscpy() automatically determines the length of the destination buffer using sizeof() if the argument is omitted. This makes the explicit sizeof() calls unnecessary. Remove them to shorten and simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20250318080755.61126-2-thorsten.blum@linux.dev Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-09tracing: Do not add length to print format in synthetic eventsSteven Rostedt
The following causes a vsnprintf fault: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger Because the synthetic event's "wakee" field is created as a dynamic string (even though the string copied is not). The print format to print the dynamic string changed from "%*s" to "%s" because another location (__set_synth_event_print_fmt()) exported this to user space, and user space did not need that. But it is still used in print_synth_event(), and the output looks like: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21 <idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129 sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50 The length isn't needed as the string is always nul terminated. Just print the string and not add the length (which was hard coded to the max string length anyway). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields"); Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09dma/contiguous: avoid warning about unused size_bytesArnd Bergmann
When building with W=1, this variable is unused for configs with CONFIG_CMA_SIZE_SEL_PERCENTAGE=y: kernel/dma/contiguous.c:67:26: error: 'size_bytes' defined but not used [-Werror=unused-const-variable=] Change this to a macro to avoid the warning. Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250409151557.3890443-1-arnd@kernel.org
2025-04-09sparc: mv sparc sysctls into their own file under arch/sparc/kernelJoel Granados
Move sparc sysctls (reboot-cmd, stop-a, scons-poweroff and tsb-ratio) into a new file (arch/sparc/kernel/setup.c). This file will be included for both 32 and 64 bit sparc. Leave "tsb-ratio" under SPARC64 ifdef as it was in kernel/sysctl.c. The sysctl table register is called with arch_initcall placing it after its original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09stack_tracer: move sysctl registration to kernel/trace/trace_stack.cJoel Granados
Move stack_tracer_enabled into trace_stack_sysctl_table. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09tracing: Move trace sysctls into trace.cJoel Granados
Move trace ctl tables into their own const array in kernel/trace/trace.c. The sysctl table register is called with subsys_initcall placing if after its original place in proc_root_init. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09signal: Move signal ctl tables into signal.cJoel Granados
Move print-fatal-signals into its own const ctl table array in kernel/signal.c. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-09panic: Move panic ctl tables into panic.cJoel Granados
Move panic, panic_on_oops, panic_print, panic_on_warn into kerne/panic.c. This is part of a greater effort to move ctl tables into their respective subsystems which will reduce the merge conflicts in kernel/sysctl.c. Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-08genirq/generic-chip: Fix incorrect lock guard conversionsGeert Uytterhoeven
When booting BeagleBone Black: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4398 lockdep_hardirqs_on_prepare+0x23c/0x280 DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled) CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: _raw_spin_unlock_irq from irq_map_generic_chip+0x144/0x190 irq_map_generic_chip from irq_domain_associate_locked+0x68/0x164 irq_domain_associate_locked from irq_create_fwspec_mapping+0x34c/0x43c irq_create_fwspec_mapping from irq_create_of_mapping+0x64/0x8c irq_create_of_mapping from irq_of_parse_and_map+0x54/0x7c irq_of_parse_and_map from dmtimer_clkevt_init_common+0x54/0x15c dmtimer_clkevt_init_common from dmtimer_systimer_init+0x41c/0x5b8 dmtimer_systimer_init from timer_probe+0x68/0xf0 timer_probe from start_kernel+0x4a4/0x6bc start_kernel from 0x0 irq event stamp: 0 hardirqs last enabled at (0): [<00000000>] 0x0 hardirqs last disabled at (0): [<00000000>] 0x0 softirqs last enabled at (0): [<00000000>] 0x0 softirqs last disabled at (0): [<00000000>] 0x0 ---[ end trace 0000000000000000 ]--- and: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at init/main.c:1022 start_kernel+0x4e8/0x6bc Interrupts were enabled early CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.15.0-rc1-boneblack-00004-g195298c3b116 #209 NONE Tainted: [W]=WARN Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x6c/0x90 dump_stack_lvl from __warn+0x70/0x1b0 __warn from warn_slowpath_fmt+0x1d4/0x1ec warn_slowpath_fmt from start_kernel+0x4e8/0x6bc start_kernel from 0x0 irq event stamp: 0 hardirqs last enabled at (0): [<00000000>] 0x0 hardirqs last disabled at (0): [<00000000>] 0x0 softirqs last enabled at (0): [<00000000>] 0x0 softirqs last disabled at (0): [<00000000>] 0x0 ---[ end trace 0000000000000000 ]--- Fix this by correcting two misconversions of raw_spin_{,un}lock_irq{save,restore}() to lock guards. Fixes: 195298c3b11628a6 ("genirq/generic-chip: Convert core code to lock guards") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/514f94c5891c61ac0a4a7fdad113e75db1eea367.1744135467.git.geert+renesas@glider.be
2025-04-08Merge tag 'probes-fixes-v6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: remove fprobe_hlist_node when module unloading When a fprobe target module is removed, the fprobe_hlist_node should be removed from the fprobe's hash table to prevent reusing accidentally if another module is loaded at the same address. - fprobe: lock module while registering fprobe The module containing the function to be probeed is locked using a reference counter until the fprobe registration is complete, which prevents use after free. - fprobe-events: fix possible UAF on modules Basically as same as above, but in the fprobe-events layer we also need to get module reference counter when we find the tracepoint in the module. * tag 'probes-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: fprobe: Cleanup fprobe hash when module unloading tracing: fprobe events: Fix possible UAF on modules tracing: fprobe: Fix to lock module while registering fprobe
2025-04-08Merge tag 'cgroup-for-6.15-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - A number of cpuset remote partition related fixes and cleanups along with selftest updates. - A change from this merge window made cgroup_rstat_updated_list() called outside cgroup_rstat_lock leading to list corruptions. Fix it by relocating the call inside the lock. * tag 'cgroup-for-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Fix race between newly created partition and dying one cgroup: rstat: call cgroup_rstat_updated_list with cgroup_rstat_lock selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh selftest/cgroup: Clean up and restructure test_cpuset_prs.sh selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs and state separator cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it cgroup/cpuset: Code cleanup and comment update cgroup/cpuset: Don't allow creation of local partition over a remote one cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition cgroup/cpuset: Fix error handling in remote_partition_disable() cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask()
2025-04-08sched_ext: Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo
Pull for-6.15-fixes to receive: e776b26e3701 ("sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings") which conflicts with: 1a7ff7216c8b ("sched_ext: Drop "ops" from scx_ops_enable_state and friends") The former removes code updated by the latter. Resolved by removing the updated section. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-08srcu: Use rcu_seq_done_exact() for polling APIJoel Fernandes
poll_state_synchronize_srcu() uses rcu_seq_done() unlike poll_state_synchronize_rcu() which uses rcu_seq_done_exact(). The rcu_seq_done_exact() makes more sense for polling API, as with this API, there is a higher chance that there is a significant delay between the get_state..() and poll_state..() calls since a cookie can be stored and reused at a later time. During such a delay, if the gp_seq counter progresses more than ULONG_MAX/2 distance, then poll_state..() may return false for a long time unwantedly. Fix by using the more accurate rcu_seq_done_exact() API which is exactly what straight RCU's polling does. It may make sense, as future work, to add debug code here as well, where we compare a physical timestamp between get_state..() and poll_state() calls and yell if significant time has past but the grace period has still not progressed. Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08sched/isolation: Make use of more than one housekeeping cpuPhil Auld
The exising code uses housekeeping_any_cpu() to select a cpu for a given housekeeping task. However, this often ends up calling cpumask_any_and() which is defined as cpumask_first_and() which has the effect of alyways using the first cpu among those available. The same applies when multiple NUMA nodes are involved. In that case the first cpu in the local node is chosen which does provide a bit of spreading but with multiple HK cpus per node the same issues arise. We have numerous cases where a single HK cpu just cannot keep up and the remote_tick warning fires. It also can lead to the other things (orchastration sw, HA keepalives etc) on the HK cpus getting starved which leads to other issues. In these cases we recommend increasing the number of HK cpus. But... that only helps the userspace tasks somewhat. It does not help the actual housekeeping part. Spread the HK work out by having housekeeping_any_cpu() and sched_numa_find_closest() use cpumask_any_and_distribute() instead of cpumask_any_and(). Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20250218184618.1331715-1-pauld@redhat.com
2025-04-08sched/rt: Fix race in push_rt_taskHarshit Agarwal
Overview ======== When a CPU chooses to call push_rt_task and picks a task to push to another CPU's runqueue then it will call find_lock_lowest_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Crashes ======= This bug resulted in quite a few flavors of crashes triggering kernel panics with various crash signatures such as assert failures, page faults, null pointer dereferences, and queue corruption errors all coming from scheduler itself. Some of the crashes: -> kernel BUG at kernel/sched/rt.c:1616! BUG_ON(idx >= MAX_RT_PRIO) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? pick_next_task_rt+0x6e/0x1d0 ? do_error_trap+0x64/0xa0 ? pick_next_task_rt+0x6e/0x1d0 ? exc_invalid_op+0x4c/0x60 ? pick_next_task_rt+0x6e/0x1d0 ? asm_exc_invalid_op+0x12/0x20 ? pick_next_task_rt+0x6e/0x1d0 __schedule+0x5cb/0x790 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: kernel NULL pointer dereference, address: 00000000000000c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? __warn+0x8a/0xe0 ? exc_page_fault+0x3d6/0x520 ? asm_exc_page_fault+0x1e/0x30 ? pick_next_task_rt+0xb5/0x1d0 ? pick_next_task_rt+0x8c/0x1d0 __schedule+0x583/0x7e0 ? update_ts_time_stats+0x55/0x70 schedule_idle+0x1e/0x40 do_idle+0x15e/0x200 cpu_startup_entry+0x19/0x20 start_secondary+0x117/0x160 secondary_startup_64_no_verify+0xb0/0xbb -> BUG: unable to handle page fault for address: ffff9464daea5900 kernel BUG at kernel/sched/rt.c:1861! BUG_ON(rq->cpu != task_cpu(p)) -> kernel BUG at kernel/sched/rt.c:1055! BUG_ON(!rq->nr_running) Call Trace: ? __die_body+0x1a/0x60 ? die+0x2a/0x50 ? do_trap+0x85/0x100 ? dequeue_top_rt_rq+0xa2/0xb0 ? do_error_trap+0x64/0xa0 ? dequeue_top_rt_rq+0xa2/0xb0 ? exc_invalid_op+0x4c/0x60 ? dequeue_top_rt_rq+0xa2/0xb0 ? asm_exc_invalid_op+0x12/0x20 ? dequeue_top_rt_rq+0xa2/0xb0 dequeue_rt_entity+0x1f/0x70 dequeue_task_rt+0x2d/0x70 __schedule+0x1a8/0x7e0 ? blk_finish_plug+0x25/0x40 schedule+0x3c/0xb0 futex_wait_queue_me+0xb6/0x120 futex_wait+0xd9/0x240 do_futex+0x344/0xa90 ? get_mm_exe_file+0x30/0x60 ? audit_exe_compare+0x58/0x70 ? audit_filter_rules.constprop.26+0x65e/0x1220 __x64_sys_futex+0x148/0x1f0 do_syscall_64+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x62/0xc7 -> BUG: unable to handle page fault for address: ffff8cf3608bc2c0 Call Trace: ? __die_body+0x1a/0x60 ? no_context+0x183/0x350 ? spurious_kernel_fault+0x171/0x1c0 ? exc_page_fault+0x3b6/0x520 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? asm_exc_page_fault+0x1e/0x30 ? _cond_resched+0x15/0x30 ? futex_wait_queue_me+0xc8/0x120 ? futex_wait+0xd9/0x240 ? try_to_wake_up+0x1b8/0x490 ? futex_wake+0x78/0x160 ? do_futex+0xcd/0xa90 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? plist_del+0x6a/0xd0 ? plist_check_list+0x15/0x40 ? plist_check_list+0x2e/0x40 ? dequeue_pushable_task+0x20/0x70 ? __schedule+0x382/0x7e0 ? asm_sysvec_reschedule_ipi+0xa/0x20 ? schedule+0x3c/0xb0 ? exit_to_user_mode_prepare+0x9e/0x150 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_sysvec_reschedule_ipi+0x12/0x20 Above are some of the common examples of the crashes that were observed due to this issue. Details ======= Let's look at the following scenario to understand this race. 1) CPU A enters push_rt_task a) CPU A has chosen next_task = task p. b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). c) CPU A identifies CPU X as a destination CPU (X < Z). d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has locked CPU X’s rq, and thus, CPU A must wait. 2) At CPU Z a) Previous task has completed execution and thus, CPU Z enters schedule, locks its own rq after CPU A releases it. b) CPU Z dequeues previous task and begins executing task p. c) CPU Z unlocks its rq. d) Task p yields the CPU (ex. by doing IO or waiting to acquire a lock) which triggers the schedule function on CPU Z. e) CPU Z enters schedule again, locks its own rq, and dequeues task p. f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq. 3) At CPU B a) CPU B enters try_to_wake_up with input task p. b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates B.state = WAKING. c) CPU B via select_task_rq determines CPU Y as the target CPU. 4) The race a) CPU A acquires CPU X’s lock and relocks CPU Z. b) CPU A reads task p.cpu = Z and incorrectly concludes task p is still on CPU Z. c) CPU A failed to notice task p had been dequeued from CPU Z while CPU A was waiting for locks in double_lock_balance. If CPU A knew that task p had been dequeued, it would return NULL forcing push_rt_task to give up the task p's migration. d) CPU B updates task p.cpu = Y and calls ttwu_queue. e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task p.on_rq = 1. f) CPU B unlocks CPU Y, triggering memory synchronization. g) CPU A reads task p.on_rq = 1, cementing its assumption that task p has not migrated. h) CPU A decides to migrate p to CPU X. This leads to A dequeuing p from Y's queue and various crashes down the line. Solution ======== The solution here is fairly simple. After obtaining the lock (at 4a), the check is enhanced to make sure that the task is still at the head of the pushable tasks list. If not, then it is anyway not suitable for being pushed out. Testing ======= The fix is tested on a cluster of 3 nodes, where the panics due to this are hit every couple of days. A fix similar to this was deployed on such cluster and was stable for more than 30 days. Co-developed-by: Jon Kohler <jon@nutanix.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Co-developed-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Signed-off-by: Gauri Patwardhan <gauri.patwardhan@nutanix.com> Co-developed-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Rahul Chunduru <rahul.chunduru@nutanix.com> Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Phil Auld <pauld@redhat.com> Tested-by: Will Ton <william.ton@nutanix.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250225180553.167995-1-harshit@nutanix.com
2025-04-08sched: Add annotations to RT_GROUP_SCHED fieldsMichal Koutný
Update comments to ease RT throttling understanding. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-10-mkoutny@suse.com
2025-04-08rcu: Comment on the extraneous delta test on rcu_seq_done_exact()Frederic Weisbecker
The numbers used in rcu_seq_done_exact() lack some explanation behind their magic. Especially after the commit: 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") which reported a subtle issue where a new GP sequence snapshot was taken on the root node state while a grace period had already been started and reflected on the global state sequence but not yet on the root node sequence, making a polling user waiting on a wrong already started grace period that would ignore freshly online CPUs. The fix involved taking the snaphot on the global state sequence and waiting on the root node sequence. And since a grace period is first started on the global state and only afterward reflected on the root node, a snapshot taken on the global state sequence might be two full grace periods ahead of the root node as in the following example: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); Add a comment about those expectations and to clarify the magic within the relevant function. Note that the issue arises mainly with the use of rcu_seq_done_exact() which has a much tigher guardband (of 2 GPs) to ensure the false-negative window of the API during wraparound is limited to just 2 GPs. rcu_seq_done() does not have such strict requirements, however its large false-negative window of ULONG_MAX/2 is not ideal for the polling API. However, this also means care is needed to ensure the guardband is as large as needed to avoid the example scenario describe above which a warning added in an earlier patch does. [ Comment wordsmithing by Joel ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08sched: Add RT_GROUP WARN checks for non-root task_groupsMichal Koutný
With CONFIG_RT_GROUP_SCHED but runtime disabling of RT_GROUPs we expect the existence of the root task_group only and all rt_sched_entity'ies should be queued on root's rt_rq. If we get a non-root RT_GROUP something went wrong. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-9-mkoutny@suse.com
2025-04-08rcu: Add warning to ensure rcu_seq_done_exact() is workingJoel Fernandes
The previous patch improved the rcu_seq_done_exact() function by adding a meaningful constant for the guardband. Ensure that this is working for the future by a quick check during rcu_gp_init(). Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08sched: Do not construct nor expose RT_GROUP_SCHED structures if disabledMichal Koutný
Thanks to kernel cmdline being available early, before any cgroup hierarchy exists, we can achieve the RT_GROUP_SCHED boottime disabling goal by simply skipping any creation (and destruction) of RT_GROUP data and its exposure via RT attributes. We can do this thanks to previously placed runtime guards that would redirect all operations to root_task_group's data when RT_GROUP_SCHED disabled. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-8-mkoutny@suse.com
2025-04-08rcu: Replace magic number with meaningful constant in rcu_seq_done_exact()Joel Fernandes
The rcu_seq_done_exact() function checks if a grace period has completed by comparing sequence numbers. It includes a guard band to handle sequence number wraparound, which was previously expressed using the magic number calculation '3 * RCU_SEQ_STATE_MASK + 1'. This magic number is not immediately obvious in terms of what it represents. Instead, the reason we need this tiny guardband is because of the lag between the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing the newly started GP as completed immediately, due to the following sequence which arises due to the delay between update of rcu_state.gp_seq_polled and root rnp's gp_seq: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); This can happen due to get_state_synchronize_rcu_full() sampling rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() sampling the root rnp's gp_seq. The delay between the update of the 2 counters occurs in rcu_gp_init() during which the counters briefly go out of sync. Make the guardband explictly 2 GPs. This improves code readability and maintainability by making the intent clearer as well. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHEDMichal Koutný
When RT_GROUPs are compiled but not exposed, their bandwidth cannot be configured (and it is not initialized for non-root task_groups neither). Therefore bypass any checks of task vs task_group bandwidth. This will achieve behavior very similar to setups that have !CONFIG_RT_GROUP_SCHED and attach cpu controller to cgroup v2 hierarchy. (On a related note, this may allow having RT tasks with CONFIG_RT_GROUP_SCHED and cgroup v2 hierarchy.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-7-mkoutny@suse.com
2025-04-08sched: Skip non-root task_groups with disabled RT_GROUP_SCHEDMichal Koutný
First, we want to prevent placement of RT tasks on non-root rt_rqs which we achieve in the task migration code that'd fall back to root_task_group's rt_rq. Second, we want to work with only root_task_group's rt_rq when iterating all "real" rt_rqs when RT_GROUP is disabled. To achieve this we keep root_task_group as the first one on the task_groups and break out quickly. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250310170442.504716-6-mkoutny@suse.com