summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-11-19Merge tag 'timers_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Do the push of pending hrtimers away from a CPU which is being offlined earlier in the offlining process in order to prevent a deadlock * tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Push pending hrtimers away from outgoing CPU earlier
2023-11-19Merge tag 'sched_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix virtual runtime calculation when recomputing a sched entity's weights - Fix wrongly rejected unprivileged poll requests to the cgroup psi pressure files - Make sure the load balancing is done by only one CPU * tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix the decision for load balance sched: psi: fix unprivileged polling against cgroups sched/eevdf: Fix vruntime adjustment on reweight
2023-11-19Merge tag 'locking_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix a hardcoded futex flags case which lead to one robust futex test failure * tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Fix hardcoded flags
2023-11-19Merge tag 'perf_urgent_for_v6.7_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Make sure the context refcount is transferred too when migrating perf events * tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix cpuctx refcounting
2023-11-19bpf: bpf_iter_task_next: use next_task(kit->task) rather than ↵Oleg Nesterov
next_task(kit->pos) This looks more clear and simplifies the code. While at it, remove the unnecessary initialization of pos/task at the start of bpf_iter_task_new(). Note that we can even kill kit->task, we can just use pos->group_leader, but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()Oleg Nesterov
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c is the last user and the usage is wrong. bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never happen if kit->pos execs. Change this code to use __next_thread(). With or without this change the usage of kit->pos/task and next_task() doesn't look nice, see the next patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()Oleg Nesterov
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c is the last user and the usage is wrong. task_group_seq_get_next() can return the group leader twice if it races with mt-thread exec which changes the group->leader's pid. Change the main loop to use __next_thread(), kill "next_tid == common->pid" check. __next_thread() can't loop forever, we can also change this code to retry if next_tid == 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18Merge tag 'parisc-for-6.7-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "On parisc we still sometimes need writeable stacks, e.g. if programs aren't compiled with gcc-14. To avoid issues with the upcoming systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now (for parisc only). The other two patches are minor: a bugfix for the soft power-off on qemu with 64-bit kernel and prefer strscpy() over strlcpy(): - Fix power soft-off on qemu - Disable prctl(PR_SET_MDWE) since parisc sometimes still needs writeable stacks - Use strscpy instead of strlcpy in show_cpuinfo()" * tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: prctl: Disable prctl(PR_SET_MDWE) on parisc parisc/power: Fix power soft-off when running on qemu parisc: Replace strlcpy() with strscpy()
2023-11-18bpf: emit frameno for PTR_TO_STACK regs if it differs from current oneAndrii Nakryiko
It's possible to pass a pointer to parent's stack to child subprogs. In such case verifier state output is ambiguous not showing whether register container a pointer to "current" stack, belonging to current subprog (frame), or it's actually a pointer to one of parent frames. So emit this information if frame number differs between the state which register is part of. E.g., if current state is in frame 2 and it has a register pointing to stack in grand parent state (frame #0), we'll see something like 'R1=fp[0]-16', while "local stack pointer" will be just 'R2=fp-16'. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: smarter verifier log number printing logicAndrii Nakryiko
Instead of always printing numbers as either decimals (and in some cases, like for "imm=%llx", in hexadecimals), decide the form based on actual values. For numbers in a reasonably small range (currently, [0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones), emit them as decimals. In all other cases, even for signed values, emit them in hexadecimals. For large values hex form is often times way more useful: it's easier to see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff, than between 18446744071562067966 and 18446744071562067967, as one particular example. Small values representing small pointer offsets or application constants, on the other hand, are way more useful to be represented in decimal notation. Adjust reg_bounds register state parsing logic to take into account this change. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: omit default off=0 and imm=0 in register state logAndrii Nakryiko
Simplify BPF verifier log further by omitting default (and frequently irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can be seen from fixed tests, this is often a visual noise for PTR_TO_CTX register and even for PTR_TO_PACKET registers. Omitting default values follows the rest of register state logic: we omit default values to keep verifier log succinct and to highlight interesting state that deviates from default one. E.g., we do the same for var_off, when it's unknown, which gives no additional information. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: emit map name in register state if applicable and availableAndrii Nakryiko
In complicated real-world applications, whenever debugging some verification error through verifier log, it often would be very useful to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be inferred from key/value sizes and maybe trying to guess C code location, but it's not always clear. Given verifier has the name, and it's never too long, let's just emit it for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We reshuffle the order a bit, so that map name, key size, and value size appear before offset and immediate values, which seems like a more logical order. Current output: R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0) But we'll get rid of useless off=0 and imm=0 parts in the next patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: print spilled register state in stack slotAndrii Nakryiko
Print the same register state representation when printing stack state, as we do for normal registers. Note that if stack slot contains subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask for those bytes that are not part of spilled register. While means we can get something like fp-8=0000scalar() for a 4-byte spill with other 4 bytes still being STACK_ZERO. Some example before and after, taken from the log of pyperf_subprogs.bpf.o: 49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx 49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0) 150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null 150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) 5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272= 5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) While at it, do a few other simple clean ups: - skip slot if it's not scratched before detecting whether it's valid; - move taking spilled_reg pointer outside of switch (only DYNPTR has to adjust that to get to the "main" slot); - don't recalculate types_buf second time for MISC/ZERO/default case. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: extract register state printingAndrii Nakryiko
Extract printing register state representation logic into a separate helper, as we are going to reuse it for spilled register state printing in the next patch. This also nicely reduces code nestedness. No functional changes. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: move verifier state printing code to kernel/bpf/log.cAndrii Nakryiko
Move a good chunk of code from verifier.c to log.c: verifier state verbose printing logic. This is an important and very much logging/debugging oriented code. It fits the overlall log.c's focus on verifier logging, and moving it allows to keep growing it without unnecessarily adding to verifier.c code that otherwise contains a core verification logic. There are not many shared dependencies between this code and the rest of verifier.c code, except a few single-line helpers for various register type checks and a bit of state "scratching" helpers. We move all such trivial helpers into include/bpf/bpf_verifier.h as static inlines. No functional changes in this patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: move verbose_linfo() into kernel/bpf/log.cAndrii Nakryiko
verifier.c is huge. Let's try to move out parts that are logging-related into log.c, as we previously did with bpf_log() and other related stuff. This patch moves line info verbose output routines: it's pretty self-contained and isolated code, so there is no problem with this. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18prctl: Disable prctl(PR_SET_MDWE) on pariscHelge Deller
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute functionality, but fails on parisc which still needs executable stacks in certain combinations of gcc/glibc/kernel. Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until userspace has catched up. Signed-off-by: Helge Deller <deller@gmx.de> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Sam James <sam@gentoo.org> Closes: https://github.com/systemd/systemd/issues/29775 Tested-by: Sam James <sam@gentoo.org> Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/ Cc: <stable@vger.kernel.org> # v6.3+
2023-11-18userns: eliminate many kernel-doc warningsRandy Dunlap
Drop the kernel-doc "/**" notation from 8 structs or functions to prevent 22 kernel-doc warnings (samples below). user_namespace.c:239: warning: Function parameter or member 'map_up' not described in 'idmap_key' user_namespace.c:246: warning: Function parameter or member 'k' not described in 'cmp_map_id' user_namespace.c:277: warning: Function parameter or member 'extents' not described in 'map_id_range_down_max' user_namespace.c:295: warning: Function parameter or member 'extents' not described in 'map_id_range_down_base' user_namespace.c:344: warning: Function parameter or member 'extents' not described in 'map_id_up_base' user_namespace.c:364: warning: Function parameter or member 'extents' not described in 'map_id_up_max' user_namespace.c:776: warning: Function parameter or member 'map' not described in 'insert_extent' user_namespace.c:844: warning: Function parameter or member 'map' not described in 'sort_idmaps' Fixes: 6397fac4915a ("userns: bump idmap limits to 340") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20230830163215.13193-1-rdunlap@infradead.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-17bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTSAndrii Nakryiko
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0]. A few selftests and veristat need to be adjusted in the same patch as well. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17Merge tag 'audit-pr-20231116' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "One small audit patch to convert a WARN_ON_ONCE() into a normal conditional to avoid scary looking console warnings when eBPF code generates audit records from unexpected places" * tag 'audit-pr-20231116' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
2023-11-16Merge tag 'net-6.7-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from BPF and netfilter. Current release - regressions: - core: fix undefined behavior in netdev name allocation - bpf: do not allocate percpu memory at init stage - netfilter: nf_tables: split async and sync catchall in two functions - mptcp: fix possible NULL pointer dereference on close Current release - new code bugs: - eth: ice: dpll: fix initial lock status of dpll Previous releases - regressions: - bpf: fix precision backtracking instruction iteration - af_unix: fix use-after-free in unix_stream_read_actor() - tipc: fix kernel-infoleak due to uninitialized TLV value - eth: bonding: stop the device in bond_setup_by_slave() - eth: mlx5: - fix double free of encap_header - avoid referencing skb after free-ing in drop path - eth: hns3: fix VF reset - eth: mvneta: fix calls to page_pool_get_stats Previous releases - always broken: - core: set SOCK_RCU_FREE before inserting socket into hashtable - bpf: fix control-flow graph checking in privileged mode - eth: ppp: limit MRU to 64K - eth: stmmac: avoid rx queue overrun - eth: icssg-prueth: fix error cleanup on failing initialization - eth: hns3: fix out-of-bounds access may occur when coalesce info is read via debugfs - eth: cortina: handle large frames Misc: - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45" * tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits) macvlan: Don't propagate promisc change to lower dev in passthru net: sched: do not offload flows with a helper in act_ct net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors net/mlx5e: Check return value of snprintf writing to fw_version buffer net/mlx5e: Reduce the size of icosq_str net/mlx5: Increase size of irq name buffer net/mlx5e: Update doorbell for port timestamping CQ before the software counter net/mlx5e: Track xmit submission to PTP WQ after populating metadata map net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload net/mlx5e: Fix pedit endianness net/mlx5e: fix double free of encap_header in update funcs net/mlx5e: fix double free of encap_header net/mlx5: Decouple PHC .adjtime and .adjphase implementations net/mlx5: DR, Allow old devices to use multi destination FTE net/mlx5: Free used cpus mask when an IRQ is released Revert "net/mlx5: DR, Supporting inline WQE when possible" bpf: Do not allocate percpu memory at init stage net: Fix undefined behavior in netdev name allocation dt-bindings: net: ethernet-controller: Fix formatting error ...
2023-11-15bpf: make __reg{32,64}_deduce_bounds logic more robustAndrii Nakryiko
This change doesn't seem to have any effect on selftests and production BPF object files, but we preemptively try to make it more robust. First, "learn sign from signed bounds" comment is misleading, as we are learning not just sign, but also values. Second, we simplify the check for determining whether entire range is positive or negative similarly to other checks added earlier, using appropriate u32/u64 cast and single comparisons. As explain in comments in __reg64_deduce_bounds(), the checks are equivalent. Last but not least, smin/smax and s32_min/s32_max reassignment based on min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard to explain and justify. We are updating unsigned bounds from signed bounds, why would we update signed bounds at the same time? This might be correct, but it's far from obvious why and the code or comments don't try to justify this. Given we've added a separate deduction of signed bounds from unsigned bounds earlier, this seems at least redundant, if not just wrong. In short, we remove doubtful pieces, and streamline the rest to follow the logic and approach of the rest of reg_bounds_sync() checks. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: remove redundant s{32,64} -> u{32,64} deduction logicAndrii Nakryiko
Equivalent checks were recently added in more succinct and, arguably, safer form in: - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds"); - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds"). The checks we are removing in this patch set do similar checks to detect if entire u32/u64 range has signed bit set or not set, but does it with two separate checks. Further, we forcefully overwrite either smin or smax (and 32-bit equvalents) without applying normal min/max intersection logic. It's not clear why that would be correct in all cases and seems to work by accident. This logic is also "gated" by previous signed -> unsigned derivation, which returns early. All this is quite confusing and seems error-prone, while we already have at least equivalent checks happening earlier. So remove this duplicate and error-prone logic to simplify things a bit. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: add register bounds sanity checks and sanitizationAndrii Nakryiko
Add simple sanity checks that validate well-formed ranges (min <= max) across u64, s64, u32, and s32 ranges. Also for cases when the value is constant (either 64-bit or 32-bit), we validate that ranges and tnums are in agreement. These bounds checks are performed at the end of BPF_ALU/BPF_ALU64 operations, on conditional jumps, and for LDX instructions (where subreg zero/sign extension is probably the most important to check). This covers most of the interesting cases. Also, we validate the sanity of the return register when manually adjusting it for some special helpers. By default, sanity violation will trigger a warning in verifier log and resetting register bounds to "unbounded" ones. But to aid development and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will trigger hard failure of verification with -EFAULT on register bounds violations. This allows selftests to catch such issues. veristat will also gain a CLI option to enable this behavior. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logicAndrii Nakryiko
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions that otherwise would be "inconclusive" (i.e., is_branch_taken() would return -1). This can happen, for example, when registers are initialized as 64-bit u64/s64, then compared for inequality as 32-bit subregisters, and then followed by 64-bit equality/inequality check. That 32-bit inequality can establish some pattern for lower 32 bits of a register (e.g., s< 0 condition determines whether the bit #31 is zero or not), while overall 64-bit value could be anything (according to a value range representation). This is not a fancy quirky special case, but actually a handling that's necessary to prevent correctness issue with BPF verifier's range tracking: set_range_min_max() assumes that register ranges are non-overlapping, and if that condition is not guaranteed by is_branch_taken() we can end up with invalid ranges, where min > max. [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/ Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: generalize is_scalar_branch_taken() logicAndrii Nakryiko
Generalize is_branch_taken logic for SCALAR_VALUE register to handle cases when both registers are not constants. Previously supported <range> vs <scalar> cases are a natural subset of more generic <range> vs <range> set of cases. Generalized logic relies on straightforward segment intersection checks. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: generalize reg_set_min_max() to handle non-const register comparisonsAndrii Nakryiko
Generalize bounds adjustment logic of reg_set_min_max() to handle not just register vs constant case, but in general any register vs any register cases. For most of the operations it's trivial extension based on range vs range comparison logic, we just need to properly pick min/max of a range to compare against min/max of the other range. For BPF_JSET we keep the original capabilities, just make sure JSET is integrated in the common framework. This is manifested in the internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more uniform rev_opcode() handling. See the code for details. This allows to reuse the same code exactly both for TRUE and FALSE branches without explicitly handling both conditions with custom code. Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE case none of the registers are constants. This is now just a normal generic case handled by reg_set_min_max(). To make tnum handling cleaner, tnum_with_subreg() helper is added, as that's a common operator when dealing with 32-bit subregister bounds. This keeps the overall logic much less noisy when it comes to tnums. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: Do not allocate percpu memory at init stageYonghong Song
Kirill Shutemov reported significant percpu memory consumption increase after booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation"). The percpu memory consumption is increased from 111MB to 969MB. The number is from /proc/meminfo. I tried to reproduce the issue with my local VM which at most supports upto 255 cpus. With 252 cpus, without the above commit, the percpu memory consumption immediately after boot is 57MB while with the above commit the percpu memory consumption is 231MB. This is not good since so far percpu memory from bpf memory allocator is not widely used yet. Let us change pre-allocation in init stage to on-demand allocation when verifier detects there is a need of percpu memory for bpf program. With this change, percpu memory consumption after boot can be reduced signicantly. [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/ Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation") Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15perf/core: Fix narrow startup race when creating the perf nr_addr_filters ↵Greg KH
sysfs file Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/2023061204-decal-flyable-6090@gregkh
2023-11-15Merge branch 'tip/perf/urgent'Peter Zijlstra
Avoid conflicts, base on fixes. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-11-15sched/timers: Explain why idle task schedules out on remote timer enqueueFrederic Weisbecker
Trying to avoid that didn't bring much value after testing, add comment about this. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org
2023-11-15sched/cpuidle: Comment about timers requirements VS idle handlerFrederic Weisbecker
Add missing explanation concerning IRQs re-enablement constraints in the cpuidle path against timers. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231114193840.4041-2-frederic@kernel.org
2023-11-15sched/deadline: Introduce deadline serversPeter Zijlstra
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks with higher priority (e.g., SCHED_FIFO) monopolize CPU(s). RT Throttling has been introduced a while ago as a (mostly debug) countermeasure one can utilize to reserve some CPU time for low priority tasks (usually background type of work, e.g. workqueues, timers, etc.). It however has its own problems (see documentation) and the undesired effect of unconditionally throttling FIFO tasks even when no lower priority activity needs to run (there are mechanisms to fix this issue as well, but, again, with their own problems). Introduce deadline servers to service low priority tasks needs under starvation conditions. Deadline servers are built extending SCHED_DEADLINE implementation to allow 2-level scheduling (a sched_deadline entity becomes a container for lower priority scheduling entities). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
2023-11-15sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entityPeter Zijlstra
In preparation of introducing !task sched_dl_entity; move the bandwidth accounting into {en.de}queue_dl_entity(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
2023-11-15sched/deadline: Collect sched_dl_entity initializationPeter Zijlstra
Create a single function that initializes a sched_dl_entity. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
2023-11-15sched: Unify more update_curr*()Peter Zijlstra
Now that trace_sched_stat_runtime() no longer takes a vruntime argument, the task specific bits are identical between update_curr_common() and update_curr(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15sched: Remove vruntime from trace_sched_stat_runtime()Peter Zijlstra
Tracing the runtime delta makes sense, observer can sum over time. Tracing the absolute vruntime makes less sense, inconsistent: absolute-vs-delta, but also vruntime delta can be computed from runtime delta. Removing the vruntime thing also makes the two tracepoint sites identical, allowing to unify the code in a later patch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15sched: Unify runtime accounting across classesPeter Zijlstra
All classes use sched_entity::exec_start to track runtime and have copies of the exact same code around to compute runtime. Collapse all that. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
2023-11-15sched/eevdf: O(1) fastpath for task selectionAbel Wu
Since the RB-tree is now sorted by deadline, let's first try the leftmost entity which has the earliest virtual deadline. I've done some benchmarks to see its effectiveness. All the benchmarks are done inside a normal cpu cgroup in a clean environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R) Platinum 8260 with 2 NUMA nodes each of which has 24C/48T. hackbench: process/thread + pipe/socket + 1/2/4/8 groups netperf: TCP/UDP + STREAM/RR + 24/48/72/96/192 threads tbench: loopback 24/48/72/96/192 threads schbench: 1/2/4/8 mthreads direct: cfs_rq has only one entity parity: RUN_TO_PARITY fast: O(1) fastpath slow: heap search (%) direct parity fast slow hackbench 92.95 2.02 4.91 0.12 netperf 68.08 6.60 24.18 1.14 tbench 67.55 11.22 20.61 0.62 schbench 69.91 2.65 25.73 1.71 The above results indicate that this fastpath really makes task selection more efficient. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
2023-11-15sched/eevdf: Sort the rbtree by virtual deadlineAbel Wu
Sort the task timeline by virtual deadline and keep the min_vruntime in the augmented tree, so we can avoid doubling the worst case cost and make full use of the cached leftmost node to enable O(1) fastpath picking in next patch. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
2023-11-15sched/numa: Fix mm numa_scan_seq based unconditional scanRaghavendra K T
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the task had previously accessed VMA. However unconditional scan of VMAs are allowed during initial phase of VMA creation until process's mm numa_scan_seq reaches 2 even though current task had not accessed VMA. Rationale: - Without initial scan subsequent PTE update may never happen. - Give fair opportunity to all the VMAs to be scanned and subsequently understand the access pattern of all the VMAs. But it has a corner case where, if a VMA is created after some time, process's mm numa_scan_seq could be already greater than 2. For e.g., values of mm numa_scan_seq when VMAs are created by running mmtest autonuma benchmark briefly looks like: start_seq=0 : 459 start_seq=2 : 138 start_seq=3 : 144 start_seq=4 : 8 start_seq=8 : 1 start_seq=9 : 1 This results in no unconditional PTE updates for those VMAs created after some time. Fix: - Note down the initial value of mm numa_scan_seq in per VMA start_seq. - Allow unconditional scan till start_seq + 2. Result: SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus. base kernel: upstream 6.6-rc6 with Mels patches [1] applied. kernbench ========== base patched %gain Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%* Duration User 41404.28 41375.08 Duration System 9862.22 9768.48 Duration Elapsed 519.87 518.72 Ops NUMA PTE updates 1041416.00 831536.00 Ops NUMA hint faults 263296.00 220966.00 Ops NUMA pages migrated 258021.00 212769.00 Ops AutoNUMA cost 1328.67 1114.69 autonumabench NUMA01_THREADLOCAL ================== Amean elsp-NUMA01_THREADLOCAL 81.79 (0.00%) 67.74 * 17.18%* Duration User 54832.73 47379.67 Duration System 75.00 185.75 Duration Elapsed 576.72 476.09 Ops NUMA PTE updates 394429.00 11121044.00 Ops NUMA hint faults 1001.00 8906404.00 Ops NUMA pages migrated 288.00 2998694.00 Ops AutoNUMA cost 7.77 44666.84 Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
2023-11-15sched: Use WRITE_ONCE() for p->on_rqPaul E. McKenney
Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side matches with WRITE_ONCE(). Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop
2023-11-15perf/core: Fix cpuctx refcountingPeter Zijlstra
Audit of the refcounting turned up that perf_pmu_migrate_context() fails to migrate the ctx refcount. Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20230612093539.085862001@infradead.org Cc: <stable@vger.kernel.org>
2023-11-15futex: Fix hardcoded flagsPeter Zijlstra
Xi reported that commit 5694289ce183 ("futex: Flag conversion") broke glibc's robust futex tests. This was narrowed down to the change of FLAGS_SHARED from 0x01 to 0x10, at which point Florian noted that handle_futex_death() has a hardcoded flags argument of 1. Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy futex ops. Reported-by: Xi Ruoyao <xry111@xry111.site> Reported-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net Fixes: 5694289ce183 ("futex: Flag conversion") Cc: <stable@vger.kernel.org>
2023-11-14audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()Paul Moore
eBPF can end up calling into the audit code from some odd places, and some of these places don't have @current set properly so we end up tripping the `WARN_ON_ONCE(!current->mm)` near the top of `audit_exe_compare()`. While the basic `!current->mm` check is good, the `WARN_ON_ONCE()` results in some scary console messages so let's drop that and just do the regular `!current->mm` check to avoid problems. Cc: <stable@vger.kernel.org> Fixes: 47846d51348d ("audit: don't take task_lock() in audit_exe_compare() code path") Reported-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-11-14sched/fair: Fix the decision for load balanceKeisuke Nishimura
should_we_balance is called for the decision to do load-balancing. When sched ticks invoke this function, only one CPU should return true. However, in the current code, two CPUs can return true. The following situation, where b means busy and i means idle, is an example, because CPU 0 and CPU 2 return true. [0, 1] [2, 3] b b i b This fix checks if there exists an idle CPU with busy sibling(s) after looking for a CPU on an idle core. If some idle CPUs with busy siblings are found, just the first one should do load-balancing. Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14sched: psi: fix unprivileged polling against cgroupsJohannes Weiner
519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") breaks unprivileged psi polling on cgroups. Historically, we had a privilege check for polling in the open() of a pressure file in /proc, but were erroneously missing it for the open() of cgroup pressure files. When unprivileged polling was introduced in d82caa273565 ("sched/psi: Allow unprivileged polling of N*2s period"), it needed to filter privileges depending on the exact polling parameters, and as such moved the CAP_SYS_RESOURCE check from the proc open() callback to psi_trigger_create(). Both the proc files as well as cgroup files go through this during write(). This implicitly added the missing check for privileges required for HT polling for cgroups. When 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") followed right after to remove further restrictions on the RT polling window, it incorrectly assumed the cgroup privilege check was still missing and added it to the cgroup open(), mirroring what we used to do for proc files in the past. As a result, unprivileged poll requests that would be supported now get rejected when opening the cgroup pressure file for writing. Remove the cgroup open() check. psi_trigger_create() handles it. Fixes: 519fabc7aaba ("psi: remove 500ms min window size limitation for triggers") Reported-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Luca Boccassi <bluca@debian.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: stable@vger.kernel.org # 6.5+ Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org
2023-11-14sched/eevdf: Fix vruntime adjustment on reweightAbel Wu
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when it gets re-weighted, and the calculations can be simplified based on the fact that re-weight won't change the w-average of all the entities. Please check the proofs in comments. But adjusting vruntime can also cause position change in RB-tree hence require re-queue to fix up which might be costly. This might be avoided by deferring adjustment to the time the entity actually leaves tree (dequeue/pick), but that will negatively affect task selection and probably not good enough either. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-14bpf: Add a new kfunc for cgroup1 hierarchyYafang Shao
A new kfunc is added to acquire cgroup1 of a task: - bpf_task_get_cgroup1 Acquires the associated cgroup of a task whithin a specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID. This new kfunc enables the tracing of tasks within a designated container or cgroup directory in BPF programs. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231111090034.4248-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-12LSM: Create lsm_list_modules system callCasey Schaufler
Create a system call to report the list of Linux Security Modules that are active on the system. The list is provided as an array of LSM ID numbers. The calling application can use this list determine what LSM specific actions it might take. That might include choosing an output format, determining required privilege or bypassing security module specific behavior. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: John Johansen <john.johansen@canonical.com> Reviewed-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Paul Moore <paul@paul-moore.com>