summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-12-05tracing: Disable snapshot buffer when stopping instance tracersSteven Rostedt (Google)
It use to be that only the top level instance had a snapshot buffer (for latency tracers like wakeup and irqsoff). When stopping a tracer in an instance would not disable the snapshot buffer. This could have some unintended consequences if the irqsoff tracer is enabled. Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that all instances behave the same. The tracing_start/stop() functions will just call their respective tracing_start/stop_tr() with the global_array passed in. Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05tracing: Stop current tracer when resizing bufferSteven Rostedt (Google)
When the ring buffer is being resized, it can cause side effects to the running tracer. For instance, there's a race with irqsoff tracer that swaps individual per cpu buffers between the main buffer and the snapshot buffer. The resize operation modifies the main buffer and then the snapshot buffer. If a swap happens in between those two operations it will break the tracer. Simply stop the running tracer before resizing the buffers and enable it again when finished. Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05tracing: Always update snapshot buffer sizeSteven Rostedt (Google)
It use to be that only the top level instance had a snapshot buffer (for latency tracers like wakeup and irqsoff). The update of the ring buffer size would check if the instance was the top level and if so, it would also update the snapshot buffer as it needs to be the same as the main buffer. Now that lower level instances also has a snapshot buffer, they too need to update their snapshot buffer sizes when the main buffer is changed, otherwise the following can be triggered: # cd /sys/kernel/tracing # echo 1500 > buffer_size_kb # mkdir instances/foo # echo irqsoff > instances/foo/current_tracer # echo 1000 > instances/foo/buffer_size_kb Produces: WARNING: CPU: 2 PID: 856 at kernel/trace/trace.c:1938 update_max_tr_single.part.0+0x27d/0x320 Which is: ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->array_buffer.buffer, cpu); if (ret == -EBUSY) { [..] } WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY); <== here That's because ring_buffer_swap_cpu() has: int ret = -EINVAL; [..] /* At least make sure the two buffers are somewhat the same */ if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages) goto out; [..] out: return ret; } Instead, update all instances' snapshot buffer sizes when their main buffer size is updated. Link: https://lkml.kernel.org/r/20231205220010.454662151@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-05bpf: track aligned STACK_ZERO cases as imprecise spilled registersAndrii Nakryiko
Now that precision backtracing is supporting register spill/fill to/from stack, there is another oportunity to be exploited here: minimizing precise STACK_ZERO cases. With a simple code change we can rely on initially imprecise register spill tracking for cases when register spilled to stack was a known zero. This is a very common case for initializing on the stack variables, including rather large structures. Often times zero has no special meaning for the subsequent BPF program logic and is often overwritten with non-zero values soon afterwards. But due to STACK_ZERO vs STACK_MISC tracking, such initial zero initialization actually causes duplication of verifier states as STACK_ZERO is clearly different than STACK_MISC or spilled SCALAR_VALUE register. The effect of this (now) trivial change is huge, as can be seen below. These are differences between BPF selftests, Cilium, and Meta-internal BPF object files relative to previous patch in this series. You can see improvements ranging from single-digit percentage improvement for instructions and states, all the way to 50-60% reduction for some of Meta-internal host agent programs, and even some Cilium programs. For Meta-internal ones I left only the differences for largest BPF object files by states/instructions, as there were too many differences in the overall output. All the differences were improvements, reducting number of states and thus instructions validated. Note, Meta-internal BPF object file names are not printed below. Many copies of balancer_ingress are actually many different configurations of Katran, so they are different BPF programs, which explains state reduction going from -16% all the way to 31%, depending on BPF program logic complexity. I also tooked a closer look at a few small-ish BPF programs to validate the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below). While it's just 8 vs 5 states, verifier log is still pretty long to include it here. But the reduction in states is due to the following piece of C code: unsigned long ino; ... sk = s->sk_socket; if (!sk) { ino = 0; } else { inode = SOCK_INODE(sk); bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino); } BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino); return 0; You can see that in some situations `ino` is zero-initialized, while in others it's unknown value filled out by bpf_probe_read_kernel(). Before this change code after if/else branches have to be validated twice. Once with (precise) ino == 0, due to eager STACK_ZERO logic, and then again for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about precise value of ino, so with the change in this patch verifier is able to prune states from after one of the branches, reducing number of total states (and instructions) required for successful validation. Similar principle applies to bigger real-world applications, just at a much larger scale. SELFTESTS ========= File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) --------------------------------------- ----------------------- --------- --------- --------------- ---------- ---------- ------------- bpf_iter_netlink.bpf.linked3.o dump_netlink 148 104 -44 (-29.73%) 8 5 -3 (-37.50%) bpf_iter_unix.bpf.linked3.o dump_unix 8474 8404 -70 (-0.83%) 151 147 -4 (-2.65%) bpf_loop.bpf.linked3.o stack_check 560 324 -236 (-42.14%) 42 24 -18 (-42.86%) local_storage_bench.bpf.linked3.o get_local 120 77 -43 (-35.83%) 9 6 -3 (-33.33%) loop6.bpf.linked3.o trace_virtqueue_add_sgs 10167 9868 -299 (-2.94%) 226 206 -20 (-8.85%) pyperf600_bpf_loop.bpf.linked3.o on_event 4872 3423 -1449 (-29.74%) 322 229 -93 (-28.88%) strobemeta.bpf.linked3.o on_event 180697 176036 -4661 (-2.58%) 4780 4734 -46 (-0.96%) test_cls_redirect.bpf.linked3.o cls_redirect 65594 65401 -193 (-0.29%) 4230 4212 -18 (-0.43%) test_global_func_args.bpf.linked3.o test_cls 145 136 -9 (-6.21%) 10 9 -1 (-10.00%) test_l4lb.bpf.linked3.o balancer_ingress 4760 2612 -2148 (-45.13%) 113 102 -11 (-9.73%) test_l4lb_noinline.bpf.linked3.o balancer_ingress 4845 4877 +32 (+0.66%) 219 221 +2 (+0.91%) test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 2072 2087 +15 (+0.72%) 97 98 +1 (+1.03%) test_seg6_loop.bpf.linked3.o __add_egr_x 12440 9975 -2465 (-19.82%) 364 353 -11 (-3.02%) test_tcp_hdr_options.bpf.linked3.o estab 2558 2572 +14 (+0.55%) 179 180 +1 (+0.56%) test_xdp_dynptr.bpf.linked3.o _xdp_tx_iptunnel 645 596 -49 (-7.60%) 26 24 -2 (-7.69%) test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3520 3516 -4 (-0.11%) 216 216 +0 (+0.00%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82661 81241 -1420 (-1.72%) 5073 5155 +82 (+1.62%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 84964 82297 -2667 (-3.14%) 5130 5157 +27 (+0.53%) META-INTERNAL ============= Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- --------- --------- ----------------- ---------- ---------- --------------- balancer_ingress 27925 23608 -4317 (-15.46%) 1488 1482 -6 (-0.40%) balancer_ingress 31824 27546 -4278 (-13.44%) 1658 1652 -6 (-0.36%) balancer_ingress 32213 27935 -4278 (-13.28%) 1689 1683 -6 (-0.36%) balancer_ingress 32213 27935 -4278 (-13.28%) 1689 1683 -6 (-0.36%) balancer_ingress 31824 27546 -4278 (-13.44%) 1658 1652 -6 (-0.36%) balancer_ingress 38647 29562 -9085 (-23.51%) 2069 1835 -234 (-11.31%) balancer_ingress 38647 29562 -9085 (-23.51%) 2069 1835 -234 (-11.31%) balancer_ingress 40339 30792 -9547 (-23.67%) 2193 1934 -259 (-11.81%) balancer_ingress 37321 29055 -8266 (-22.15%) 1972 1795 -177 (-8.98%) balancer_ingress 38176 29753 -8423 (-22.06%) 2008 1831 -177 (-8.81%) balancer_ingress 29193 20910 -8283 (-28.37%) 1599 1422 -177 (-11.07%) balancer_ingress 30013 21452 -8561 (-28.52%) 1645 1447 -198 (-12.04%) balancer_ingress 28691 24290 -4401 (-15.34%) 1545 1531 -14 (-0.91%) balancer_ingress 34223 28965 -5258 (-15.36%) 1984 1875 -109 (-5.49%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35868 26455 -9413 (-26.24%) 2140 1827 -313 (-14.63%) balancer_ingress 35868 26455 -9413 (-26.24%) 2140 1827 -313 (-14.63%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 34844 29485 -5359 (-15.38%) 2036 1918 -118 (-5.80%) fbflow_egress 3256 2652 -604 (-18.55%) 218 192 -26 (-11.93%) fbflow_ingress 1026 944 -82 (-7.99%) 70 63 -7 (-10.00%) sslwall_tc_egress 8424 7360 -1064 (-12.63%) 498 458 -40 (-8.03%) syar_accept_protect 15040 9539 -5501 (-36.58%) 364 220 -144 (-39.56%) syar_connect_tcp_v6 15036 9535 -5501 (-36.59%) 360 216 -144 (-40.00%) syar_connect_udp_v4 15039 9538 -5501 (-36.58%) 361 217 -144 (-39.89%) syar_connect_connect4_protect4 24805 15833 -8972 (-36.17%) 756 480 -276 (-36.51%) syar_lsm_file_open 167772 151813 -15959 (-9.51%) 1836 1667 -169 (-9.20%) syar_namespace_create_new 14805 9304 -5501 (-37.16%) 353 209 -144 (-40.79%) syar_python3_detect 17531 12030 -5501 (-31.38%) 391 247 -144 (-36.83%) syar_ssh_post_fork 16412 10911 -5501 (-33.52%) 405 261 -144 (-35.56%) syar_enter_execve 14728 9227 -5501 (-37.35%) 345 201 -144 (-41.74%) syar_enter_execveat 14728 9227 -5501 (-37.35%) 345 201 -144 (-41.74%) syar_exit_execve 16622 11121 -5501 (-33.09%) 376 232 -144 (-38.30%) syar_exit_execveat 16622 11121 -5501 (-33.09%) 376 232 -144 (-38.30%) syar_syscalls_kill 15288 9787 -5501 (-35.98%) 398 254 -144 (-36.18%) syar_task_enter_pivot_root 14898 9397 -5501 (-36.92%) 357 213 -144 (-40.34%) syar_syscalls_setreuid 16678 11177 -5501 (-32.98%) 429 285 -144 (-33.57%) syar_syscalls_setuid 16678 11177 -5501 (-32.98%) 429 285 -144 (-33.57%) syar_syscalls_process_vm_readv 14959 9458 -5501 (-36.77%) 364 220 -144 (-39.56%) syar_syscalls_process_vm_writev 15757 10256 -5501 (-34.91%) 390 246 -144 (-36.92%) do_uprobe 15519 10018 -5501 (-35.45%) 373 229 -144 (-38.61%) edgewall 179715 55783 -123932 (-68.96%) 12607 3999 -8608 (-68.28%) bictcp_state 7570 4131 -3439 (-45.43%) 496 269 -227 (-45.77%) cubictcp_state 7570 4131 -3439 (-45.43%) 496 269 -227 (-45.77%) tcp_rate_skb_delivered 447 272 -175 (-39.15%) 29 18 -11 (-37.93%) kprobe__bbr_set_state 4566 2615 -1951 (-42.73%) 209 124 -85 (-40.67%) kprobe__bictcp_state 4566 2615 -1951 (-42.73%) 209 124 -85 (-40.67%) inet_sock_set_state 1501 1337 -164 (-10.93%) 93 85 -8 (-8.60%) tcp_retransmit_skb 1145 981 -164 (-14.32%) 67 59 -8 (-11.94%) tcp_retransmit_synack 1183 951 -232 (-19.61%) 67 55 -12 (-17.91%) bpf_tcptuner 1459 1187 -272 (-18.64%) 99 80 -19 (-19.19%) tw_egress 801 776 -25 (-3.12%) 69 66 -3 (-4.35%) tw_ingress 795 770 -25 (-3.14%) 69 66 -3 (-4.35%) ttls_tc_ingress 19025 19383 +358 (+1.88%) 470 465 -5 (-1.06%) ttls_nat_egress 490 299 -191 (-38.98%) 33 20 -13 (-39.39%) ttls_nat_ingress 448 285 -163 (-36.38%) 32 21 -11 (-34.38%) tw_twfw_egress 511127 212071 -299056 (-58.51%) 16733 8504 -8229 (-49.18%) tw_twfw_ingress 500095 212069 -288026 (-57.59%) 16223 8504 -7719 (-47.58%) tw_twfw_tc_eg 511113 212064 -299049 (-58.51%) 16732 8504 -8228 (-49.18%) tw_twfw_tc_in 500095 212069 -288026 (-57.59%) 16223 8504 -7719 (-47.58%) tw_twfw_egress 12632 12435 -197 (-1.56%) 276 260 -16 (-5.80%) tw_twfw_ingress 12631 12454 -177 (-1.40%) 278 261 -17 (-6.12%) tw_twfw_tc_eg 12595 12435 -160 (-1.27%) 274 259 -15 (-5.47%) tw_twfw_tc_in 12631 12454 -177 (-1.40%) 278 261 -17 (-6.12%) tw_xdp_dump 266 209 -57 (-21.43%) 9 8 -1 (-11.11%) CILIUM ========= File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------- -------------------------------- --------- --------- ---------------- ---------- ---------- -------------- bpf_host.o cil_to_netdev 6047 4578 -1469 (-24.29%) 362 249 -113 (-31.22%) bpf_host.o handle_lxc_traffic 2227 1585 -642 (-28.83%) 156 103 -53 (-33.97%) bpf_host.o tail_handle_ipv4_from_netdev 2244 1458 -786 (-35.03%) 163 106 -57 (-34.97%) bpf_host.o tail_handle_nat_fwd_ipv4 21022 10479 -10543 (-50.15%) 1289 670 -619 (-48.02%) bpf_host.o tail_handle_nat_fwd_ipv6 15433 11375 -4058 (-26.29%) 905 643 -262 (-28.95%) bpf_host.o tail_ipv4_host_policy_ingress 2219 1367 -852 (-38.40%) 161 96 -65 (-40.37%) bpf_host.o tail_nodeport_nat_egress_ipv4 22460 19862 -2598 (-11.57%) 1469 1293 -176 (-11.98%) bpf_host.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_host.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_host.o tail_nodeport_nat_ipv6_egress 3702 3542 -160 (-4.32%) 215 205 -10 (-4.65%) bpf_lxc.o tail_handle_nat_fwd_ipv4 21022 10479 -10543 (-50.15%) 1289 670 -619 (-48.02%) bpf_lxc.o tail_handle_nat_fwd_ipv6 15433 11375 -4058 (-26.29%) 905 643 -262 (-28.95%) bpf_lxc.o tail_ipv4_ct_egress 5073 3374 -1699 (-33.49%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv4_ct_ingress 5093 3385 -1708 (-33.54%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 5093 3385 -1708 (-33.54%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv6_ct_egress 4593 3878 -715 (-15.57%) 194 151 -43 (-22.16%) bpf_lxc.o tail_ipv6_ct_ingress 4606 3891 -715 (-15.52%) 194 151 -43 (-22.16%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 4606 3891 -715 (-15.52%) 194 151 -43 (-22.16%) bpf_lxc.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_overlay.o tail_handle_nat_fwd_ipv4 20524 10114 -10410 (-50.72%) 1271 638 -633 (-49.80%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 22718 19490 -3228 (-14.21%) 1475 1275 -200 (-13.56%) bpf_overlay.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_overlay.o tail_nodeport_nat_ipv6_egress 3638 3548 -90 (-2.47%) 209 203 -6 (-2.87%) bpf_overlay.o tail_rev_nodeport_lb4 4368 3820 -548 (-12.55%) 248 215 -33 (-13.31%) bpf_overlay.o tail_rev_nodeport_lb6 2867 2428 -439 (-15.31%) 167 140 -27 (-16.17%) bpf_sock.o cil_sock6_connect 1718 1703 -15 (-0.87%) 100 99 -1 (-1.00%) bpf_xdp.o tail_handle_nat_fwd_ipv4 12917 12443 -474 (-3.67%) 875 849 -26 (-2.97%) bpf_xdp.o tail_handle_nat_fwd_ipv6 13515 13264 -251 (-1.86%) 715 702 -13 (-1.82%) bpf_xdp.o tail_lb_ipv4 39492 36367 -3125 (-7.91%) 2430 2251 -179 (-7.37%) bpf_xdp.o tail_lb_ipv6 80441 78058 -2383 (-2.96%) 3647 3523 -124 (-3.40%) bpf_xdp.o tail_nodeport_ipv6_dsr 1038 901 -137 (-13.20%) 61 55 -6 (-9.84%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 13027 12096 -931 (-7.15%) 868 809 -59 (-6.80%) bpf_xdp.o tail_nodeport_nat_ingress_ipv4 7617 5900 -1717 (-22.54%) 522 413 -109 (-20.88%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 7575 7395 -180 (-2.38%) 383 374 -9 (-2.35%) bpf_xdp.o tail_rev_nodeport_lb4 6808 6739 -69 (-1.01%) 403 396 -7 (-1.74%) bpf_xdp.o tail_rev_nodeport_lb6 16173 15847 -326 (-2.02%) 1010 990 -20 (-1.98%) Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05bpf: preserve constant zero when doing partial register restoreAndrii Nakryiko
Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from stack from slot that has register spilled into it and that register has a constant value zero, preserve that zero and mark spilled register as precise for that. This makes spilled const zero register and STACK_ZERO cases equivalent in their behavior. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05bpf: preserve STACK_ZERO slots on partial reg spillsAndrii Nakryiko
Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in situations where this is possible. E.g., when spilling register as 1/2/4-byte subslots on the stack, all the remaining bytes in the stack slot do not automatically become unknown. If we knew they contained zeroes, we can preserve those STACK_ZERO markers. Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(), but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note that we need to take into account possibility of being in unprivileged mode, in which case STACK_INVALID is forced to STACK_MISC for correctness, as treating STACK_INVALID as equivalent STACK_MISC is only enabled in privileged mode. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05bpf: fix check for attempt to corrupt spilled pointerAndrii Nakryiko
When register is spilled onto a stack as a 1/2/4-byte register, we set slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it, depending on actual spill size). So to check if some stack slot has spilled register we need to consult slot_type[7], not slot_type[0]. To avoid the need to remember and double-check this in the future, just use is_spilled_reg() helper. Fixes: 27113c59b6d0 ("bpf: Check the other end of slot_type for STACK_SPILL") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05bpf: support non-r10 register spill/fill to/from stack in precision trackingAndrii Nakryiko
Use instruction (jump) history to record instructions that performed register spill/fill to/from stack, regardless if this was done through read-only r10 register, or any other register after copying r10 into it *and* potentially adjusting offset. To make this work reliably, we push extra per-instruction flags into instruction history, encoding stack slot index (spi) and stack frame number in extra 10 bit flags we take away from prev_idx in instruction history. We don't touch idx field for maximum performance, as it's checked most frequently during backtracking. This change removes basically the last remaining practical limitation of precision backtracking logic in BPF verifier. It fixes known deficiencies, but also opens up new opportunities to reduce number of verified states, explored in the subsequent patches. There are only three differences in selftests' BPF object files according to veristat, all in the positive direction (less states). File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- ------------- --------- --------- ------------- ---------- ---------- ------------- test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%) Note, I avoided renaming jmp_history to more generic insn_hist to minimize number of lines changed and potential merge conflicts between bpf and bpf-next trees. Notice also cur_hist_entry pointer reset to NULL at the beginning of instruction verification loop. This pointer avoids the problem of relying on last jump history entry's insn_idx to determine whether we already have entry for current instruction or not. It can happen that we added jump history entry because current instruction is_jmp_point(), but also we need to add instruction flags for stack access. In this case, we don't want to entries, so we need to reuse last added entry, if it is present. Relying on insn_idx comparison has the same ambiguity problem as the one that was fixed recently in [0], so we avoid that. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05iov_iter: replace import_single_range() with import_ubuf()Jens Axboe
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05iov_iter: remove unused 'iov' argument from import_single_range()Jens Axboe
It is entirely unused, just get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-2-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-05cpu/hotplug: remove CPUHP_SLAB_PREPARE hooksVlastimil Babka
The CPUHP_SLAB_PREPARE hooks are only used by SLAB which is removed. SLUB defines them as NULL, so we can remove those altogether. Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-12-04bpf: Optimize the free of inner mapHou Tao
When removing the inner map from the outer map, the inner map will be freed after one RCU grace period and one RCU tasks trace grace period, so it is certain that the bpf program, which may access the inner map, has exited before the inner map is freed. However there is no need to wait for one RCU tasks trace grace period if the outer map is only accessed by non-sleepable program. So adding sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding the outer map into env->used_maps for sleepable program. Although the max number of bpf program is INT_MAX - 1, the number of bpf programs which are being loaded may be greater than INT_MAX, so using atomic64_t instead of atomic_t for sleepable_refcnt. When removing the inner map from the outer map, using sleepable_refcnt to decide whether or not a RCU tasks trace grace period is needed before freeing the inner map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Defer the free of inner map when necessaryHou Tao
When updating or deleting an inner map in map array or map htab, the map may still be accessed by non-sleepable program or sleepable program. However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map directly through bpf_map_put(), if the ref-counter is the last one (which is true for most cases), the inner map will be freed by ops->map_free() in a kworker. But for now, most .map_free() callbacks don't use synchronize_rcu() or its variants to wait for the elapse of a RCU grace period, so after the invocation of ops->map_free completes, the bpf program which is accessing the inner map may incur use-after-free problem. Fix the free of inner map by invoking bpf_map_free_deferred() after both one RCU grace period and one tasks trace RCU grace period if the inner map has been removed from the outer map before. The deferment is accomplished by using call_rcu() or call_rcu_tasks_trace() when releasing the last ref-counter of bpf map. The newly-added rcu_head field in bpf_map shares the same storage space with work field to reduce the size of bpf_map. Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.") Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Set need_defer as false when clearing fd array during map freeHou Tao
Both map deletion operation, map release and map free operation use fd_array_map_delete_elem() to remove the element from fd array and need_defer is always true in fd_array_map_delete_elem(). For the map deletion operation and map release operation, need_defer=true is necessary, because the bpf program, which accesses the element in fd array, may still alive. However for map free operation, it is certain that the bpf program which owns the fd array has already been exited, so setting need_defer as false is appropriate for map free operation. So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and adding a new helper __fd_array_map_delete_elem() to handle the map deletion, map release and map free operations correspondingly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Add map and need_defer parameters to .map_fd_put_ptr()Hou Tao
map is the pointer of outer map, and need_defer needs some explanation. need_defer tells the implementation to defer the reference release of the passed element and ensure that the element is still alive before the bpf program, which may manipulate it, exits. The following three cases will invoke map_fd_put_ptr() and different need_defer values will be passed to these callers: 1) release the reference of the old element in the map during map update or map deletion. The release must be deferred, otherwise the bpf program may incur use-after-free problem, so need_defer needs to be true. 2) release the reference of the to-be-added element in the error path of map update. The to-be-added element is not visible to any bpf program, so it is OK to pass false for need_defer parameter. 3) release the references of all elements in the map during map release. Any bpf program which has access to the map must have been exited and released, so need_defer=false will be OK. These two parameters will be used by the following patches to fix the potential use-after-free problem for map-in-map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Check rcu_read_lock_trace_held() before calling bpf map helpersHou Tao
These three bpf_map_{lookup,update,delete}_elem() helpers are also available for sleepable bpf program, so add the corresponding lock assertion for sleepable bpf program, otherwise the following warning will be reported when a sleepable bpf program manipulates bpf map under interpreter mode (aka bpf_jit_enable=0): WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ...... CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:bpf_map_lookup_elem+0x54/0x60 ...... Call Trace: <TASK> ? __warn+0xa5/0x240 ? bpf_map_lookup_elem+0x54/0x60 ? report_bug+0x1ba/0x1f0 ? handle_bug+0x40/0x80 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1b/0x20 ? __pfx_bpf_map_lookup_elem+0x10/0x10 ? rcu_lockdep_current_cpu_online+0x65/0xb0 ? rcu_is_watching+0x23/0x50 ? bpf_map_lookup_elem+0x54/0x60 ? __pfx_bpf_map_lookup_elem+0x10/0x10 ___bpf_prog_run+0x513/0x3b70 __bpf_prog_run32+0x9d/0xd0 ? __bpf_prog_enter_sleepable_recur+0xad/0x120 ? __bpf_prog_enter_sleepable_recur+0x3e/0x120 bpf_trampoline_6442580665+0x4d/0x1000 __x64_sys_getpgid+0x5/0x30 ? do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04kernel/resource: Increment by align value in get_free_mem_region()Alison Schofield
Currently get_free_mem_region() searches for available capacity in increments equal to the region size being requested. This can cause the search to take giant steps through the resource leaving needless gaps and missing available space. Specifically 'cxl create-region' fails with ERANGE even though capacity of the given size and CXL's expected 256M x InterleaveWays alignment can be satisfied. Replace the total-request-size increment with a next alignment increment so that the next possible address is always examined for availability. Fixes: 14b80582c43e ("resource: Introduce alloc_free_mem_region()") Reported-by: Dmytro Adamenko <dmytro.adamenko@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/20231113221324.1118092-1-alison.schofield@intel.com Cc: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-12-04bpf: Minor logging improvementAndrei Matei
One place where we were logging a register was only logging the variable part, not also the fixed part. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231204011248.2040084-1-andreimatei1@gmail.com
2023-12-03Merge tag 'probes-fixes-v6.7-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - objpool: Fix objpool overrun case on memory/cache access delay especially on the big.LITTLE SoC. The objpool uses a copy of object slot index internal loop, but the slot index can be changed on another processor in parallel. In that case, the difference of 'head' local copy and the 'slot->last' index will be bigger than local slot size. In that case, we need to re-read the slot::head to update it. - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since kretprobe_holder::rp is RCU managed, it should use rcu_assign_pointer() and rcu_dereference_check() correctly. Also adding __rcu tag for finding wrong usage by sparse. - rethook: Fix to use appropriate rcu API for rethook::handler. The same as kretprobe, rethook::handler is RCU managed and it should use rcu_assign_pointer() and rcu_dereference_check(). This also adds __rcu tag for finding wrong usage by sparse. * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rethook: Use __rcu pointer for rethook::handler kprobes: consistent rcu api usage for kretprobe holder lib: objpool: fix head overrun on RK3588 SBC
2023-12-02bpf: simplify tnum output if a fully known constantAndrii Nakryiko
Emit tnum representation as just a constant if all bits are known. Use decimal-vs-hex logic to determine exact format of emitted constant value, just like it's done for register range values. For that move tnum_strn() to kernel/bpf/log.c to reuse decimal-vs-hex determination logic and constants. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precision of R0 on program/async callback returnAndrii Nakryiko
Given we enforce a valid range for program and async callback return value, we must mark R0 as precise to avoid incorrect state pruning. Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: unify async callback and program retval checksAndrii Nakryiko
Use common logic to verify program return values and async callback return values. This allows to avoid duplication of any extra steps necessary, like precision marking, which will be added in the next patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precise retval range on program exitAndrii Nakryiko
Similarly to subprog/callback logic, enforce return value of BPF program using more precise smin/smax range. We need to adjust a bunch of tests due to a changed format of an error message. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce exact retval range on subprog/callback exitAndrii Nakryiko
Instead of relying on potentially imprecise tnum representation of expected return value range for callbacks and subprogs, validate that smin/smax range satisfy exact expected range of return values. E.g., if callback would need to return [0, 2] range, tnum can't represent this precisely and instead will allow [0, 3] range. By checking smin/smax range, we can make sure that subprog/callback indeed returns only valid [0, 2] range. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precision of R0 on callback returnAndrii Nakryiko
Given verifier checks actual value, r0 has to be precise, so we need to propagate precision properly. r0 also has to be marked as read, otherwise subsequent state comparisons will ignore such register as unimportant and precision won't really help here. Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: provide correct register name for exception callback retval checkAndrii Nakryiko
bpf_throw() is checking R1, so let's report R1 in the log. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01bpf: Add kfunc bpf_get_file_xattrSong Liu
It is common practice for security solutions to store tags/labels in xattrs. To implement similar functionalities in BPF LSM, add new kfunc bpf_get_file_xattr(). The first use case of bpf_get_file_xattr() is to implement file verifications with asymmetric keys. Specificially, security applications could use fsverity for file hashes and use xattr to store file signatures. (kfunc for fsverity hash will be added in a separate commit.) Currently, only xattrs with "user." prefix can be read with kfunc bpf_get_file_xattr(). As use cases evolve, we may add a dedicated prefix for bpf_get_file_xattr(). To avoid recursion, bpf_get_file_xattr can be only called from LSM hooks. Signed-off-by: Song Liu <song@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20231129234417.856536-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4Yonghong Song
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8b9 ("bpf: Support new 32bit offset jmp instruction") added support for new 32bit offset jmp instruction. Unfortunately, in function bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset (plus/minor a small delta) compares to 16-bit offset bound [S16_MIN, S16_MAX], which caused the following verification failure: $ ./test_progs-cpuv4 -t verif_scale_pyperf180 ... insn 10 cannot be patched due to 16-bit range ... libbpf: failed to load object 'pyperf180.bpf.o' scale_test:FAIL:expect_success unexpected error: -12 (errno 12) #405 verif_scale_pyperf180:FAIL Note that due to recent llvm18 development, the patch [2] (already applied in bpf-next) needs to be applied to bpf tree for testing purpose. The fix is rather simple. For 32bit offset branch insn, the adjusted offset compares to [S32_MIN, S32_MAX] and then verification succeeded. [1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev [2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev Fixes: 4cd58e9af8b9 ("bpf: Support new 32bit offset jmp instruction") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
2023-12-01tracing/uprobe: Replace strlcpy() with strscpy()Kees Cook
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated[1]. Additionally, it returns the size of the source string, not the resulting size of the destination string. In an effort to remove strlcpy() completely[2], replace strlcpy() here with strscpy(). The negative return value is already handled by this code so no new handling is needed here. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1] Link: https://github.com/KSPP/linux/issues/89 [2] Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: linux-trace-kernel@vger.kernel.org Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20231130205607.work.463-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01params: Fix multi-line comment styleAndy Shevchenko
The multi-line comment style in the file is rather arbitrary. Make it follow the standard one. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231120151419.1661807-6-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01params: Sort headersAndy Shevchenko
Sort the headers in alphabetic order in order to ease the maintenance for this part. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231120151419.1661807-5-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01params: Use size_add() for kmalloc()Andy Shevchenko
Prevent allocations from integer overflow by using size_add(). Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231120151419.1661807-4-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01params: Do not go over the limit when getting the string lengthAndy Shevchenko
We can use strnlen() even on early stages and it prevents from going over the string boundaries in case it's already too long. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231120151419.1661807-3-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01params: Introduce the param_unknown_fn typeAndy Shevchenko
Introduce a new type for the callback to parse an unknown argument. This unifies function prototypes which takes that as a parameter. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20231120151419.1661807-2-andriy.shevchenko@linux.intel.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01cgroup/rstat: Optimize cgroup_rstat_updated_list()Waiman Long
The current design of cgroup_rstat_cpu_pop_updated() is to traverse the updated tree in a way to pop out the leaf nodes first before their parents. This can cause traversal of multiple nodes before a leaf node can be found and popped out. IOW, a given node in the tree can be visited multiple times before the whole operation is done. So it is not very efficient and the code can be hard to read. With the introduction of cgroup_rstat_updated_list() to build a list of cgroups to be flushed first before any flushing operation is being done, we can optimize the way the updated tree nodes are being popped by pushing the parents first to the tail end of the list before their children. In this way, most updated tree nodes will be visited only once with the exception of the subtree root as we still need to go back to its parent and popped it out of its updated_children list. This also makes the code easier to read. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-12-01locking/mutex: Document that mutex_unlock() is non-atomicJann Horn
I have seen several cases of attempts to use mutex_unlock() to release an object such that the object can then be freed by another task. This is not safe because mutex_unlock(), in the MUTEX_FLAG_WAITERS && !MUTEX_FLAG_HANDOFF case, accesses the mutex structure after having marked it as unlocked; so mutex_unlock() requires its caller to ensure that the mutex stays alive until mutex_unlock() returns. If MUTEX_FLAG_WAITERS is set and there are real waiters, those waiters have to keep the mutex alive, but we could have a spurious MUTEX_FLAG_WAITERS left if an interruptible/killable waiter bailed between the points where __mutex_unlock_slowpath() did the cmpxchg reading the flags and where it acquired the wait_lock. ( With spinlocks, that kind of code pattern is allowed and, from what I remember, used in several places in the kernel. ) Document this, such a semantic difference between mutexes and spinlocks is fairly unintuitive. [ mingo: Made the changelog a bit more assertive, refined the comments. ] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231130204817.2031407-1-jannh@google.com
2023-12-01rethook: Use __rcu pointer for rethook::handlerMasami Hiramatsu (Google)
Since the rethook::handler is an RCU-maganged pointer so that it will notice readers the rethook is stopped (unregistered) or not, it should be an __rcu pointer and use appropriate functions to be accessed. This will use appropriate memory barrier when accessing it. OTOH, rethook::data is never changed, so we don't need to check it in get_kretprobe(). NOTE: To avoid sparse warning, rethook::handler is defined by a raw function pointer type with __rcu instead of rethook_handler_t. Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/ Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/ Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-01kprobes: consistent rcu api usage for kretprobe holderJP Kobryn
It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is RCU-managed, based on the (non-rethook) implementation of get_kretprobe(). The thought behind this patch is to make use of the RCU API where possible when accessing this pointer so that the needed barriers are always in place and to self-document the code. The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes done to the "rp" pointer are changed to make use of the RCU macro for assignment. For the single read, the implementation of get_kretprobe() is simplified by making use of an RCU macro which accomplishes the same, but note that the log warning text will be more generic. I did find that there is a difference in assembly generated between the usage of the RCU macros vs without. For example, on arm64, when using rcu_assign_pointer(), the corresponding store instruction is a store-release (STLR) which has an implicit barrier. When normal assignment is done, a regular store (STR) is found. In the macro case, this seems to be a result of rcu_assign_pointer() using smp_store_release() when the value to write is not NULL. Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/ Fixes: d741bf41d7c7 ("kprobes: Remove kretprobe hash") Cc: stable@vger.kernel.org Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-30Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-30 We've added 30 non-merge commits during the last 7 day(s) which contain a total of 58 files changed, 1598 insertions(+), 154 deletions(-). The main changes are: 1) Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload, from Stanislav Fomichev with stmmac implementation from Song Yoong Siang. 2) Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques, from Andrii Nakryiko. 3) Add BPF link_info support for uprobe multi link along with bpftool integration for the latter, from Jiri Olsa. 4) Use pkg-config in BPF selftests to determine ld flags which is in particular needed for linking statically, from Akihiko Odaki. 5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits) bpf/tests: Remove duplicate JSGT tests selftests/bpf: Add TX side to xdp_hw_metadata selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP selftests/bpf: Add TX side to xdp_metadata selftests/bpf: Add csum helpers selftests/xsk: Support tx_metadata_len xsk: Add option to calculate TX checksum in SW xsk: Validate xsk_tx_metadata flags xsk: Document tx_metadata_len layout net: stmmac: Add Tx HWTS support to XDP ZC net/mlx5e: Implement AF_XDP TX timestamp and checksum offload tools: ynl: Print xsk-features from the sample xsk: Add TX timestamp and TX checksum offload support xsk: Support tx_metadata_len selftests/bpf: Use pkg-config for libelf selftests/bpf: Override PKG_CONFIG for static builds selftests/bpf: Choose pkg-config for the target bpftool: Add support to display uprobe_multi links selftests/bpf: Add link_info test for uprobe_multi link selftests/bpf: Use bpf_link__destroy in fill_link_info tests ... ==================== Conflicts: Documentation/netlink/specs/netdev.yaml: 839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools") 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/ While at it also regen, tree is dirty after: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") looks like code wasn't re-rendered after "render-max" was removed. Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-01Merge tag 'net-6.7-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and wifi. Current release - regressions: - neighbour: fix __randomize_layout crash in struct neighbour - r8169: fix deadlock on RTL8125 in jumbo mtu mode Previous releases - regressions: - wifi: - mac80211: fix warning at station removal time - cfg80211: fix CQM for non-range use - tools: ynl-gen: fix unexpected response handling - octeontx2-af: fix possible buffer overflow - dpaa2: recycle the RX buffer only after all processing done - rswitch: fix missing dev_kfree_skb_any() in error path Previous releases - always broken: - ipv4: fix uaf issue when receiving igmp query packet - wifi: mac80211: fix debugfs deadlock at device removal time - bpf: - sockmap: af_unix stream sockets need to hold ref for pair sock - netdevsim: don't accept device bound programs - selftests: fix a char signedness issue - dsa: mv88e6xxx: fix marvell 6350 probe crash - octeontx2-pf: restore TC ingress police rules when interface is up - wangxun: fix memory leak on msix entry - ravb: keep reverse order of operations in ravb_remove()" * tag 'net-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits) net: ravb: Keep reverse order of operations in ravb_remove() net: ravb: Stop DMA in case of failures on ravb_open() net: ravb: Start TX queues after HW initialization succeeded net: ravb: Make write access to CXR35 first before accessing other EMAC registers net: ravb: Use pm_runtime_resume_and_get() net: ravb: Check return value of reset_control_deassert() net: libwx: fix memory leak on msix entry ice: Fix VF Reset paths when interface in a failed over aggregate bpf, sockmap: Add af_unix test with both sockets in map bpf, sockmap: af_unix stream sockets need to hold ref for pair sock tools: ynl-gen: always construct struct ynl_req_state ethtool: don't propagate EOPNOTSUPP from dumps ravb: Fix races between ravb_tx_timeout_work() and net related ops r8169: prevent potential deadlock in rtl8169_close r8169: fix deadlock on RTL8125 in jumbo mtu mode neighbour: Fix __randomize_layout crash in struct neighbour octeontx2-pf: Restore TC ingress police rules when interface is up octeontx2-pf: Fix adding mbox work queue entry when num_vfs > 64 net: stmmac: xgmac: Disable FPE MMC interrupts octeontx2-af: Fix possible buffer overflow ...
2023-11-29ptrace: Convert ptrace_attach() to use lock guardsPeter Zijlstra
Created as testing for the conditional guard infrastructure. Specifically this makes use of the following form: scoped_cond_guard (mutex_intr, return -ERESTARTNOINTR, &task->signal->cred_guard_mutex) { ... } ... return 0; Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20231102110706.568467727%40infradead.org
2023-11-29freezer,sched: Clean saved_state when restoring it during thawElliot Berman
Clean saved_state after using it during thaw. Cleaning the saved_state allows us to avoid some unnecessary branches in ttwu_state_match. Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20231120-freezer-state-multiple-thaws-v1-2-f2e1dd7ce5a2@quicinc.com
2023-11-29sched/fair: Update min_vruntime for reweight_entity() correctlyYiwei Lin
Since reweight_entity() may have chance to change the weight of cfs_rq->curr entity, we should also update_min_vruntime() if this is the case Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight") Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abel Wu <wuyun.abel@bytedance.com> Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com
2023-11-29perf: Fix perf_event_validate_size()Peter Zijlstra
Budimir noted that perf_event_validate_size() only checks the size of the newly added event, even though the sizes of all existing events can also change due to not all events having the same read_format. When we attach the new event, perf_group_attach(), we do re-compute the size for all events. Fixes: a723968c0ed3 ("perf: Fix u16 overflows") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-29freezer,sched: Do not restore saved_state of a thawed taskElliot Berman
It is possible for a task to be thawed multiple times when mixing the *legacy* cgroup freezer and system-wide freezer. To do this, freeze the cgroup, do system-wide freeze/thaw, then thaw the cgroup. When this happens, then a stale saved_state can be written to the task's state and cause task to hang indefinitely. Fix this by only trying to thaw tasks that are actually frozen. This change also has the marginal benefit avoiding unnecessary wake_up_state(p, TASK_FROZEN) if we know the task is already thawed. There is not possibility of time-of-compare/time-of-use race when we skip the wake_up_state because entering/exiting TASK_FROZEN is guarded by freezer_lock. Fixes: 8f0eed4a78a8 ("freezer,sched: Use saved_state to reduce some spurious wakeups") Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Abhijeet Dharmapurikar <quic_adharmap@quicinc.com> Link: https://lore.kernel.org/r/20231120-freezer-state-multiple-thaws-v1-1-f2e1dd7ce5a2@quicinc.com
2023-11-28bpf: Add link_info support for uprobe multi linkJiri Olsa
Adding support to get uprobe_link details through bpf_link_info interface. Adding new struct uprobe_multi to struct bpf_link_info to carry the uprobe_multi link details. The uprobe_multi.count is passed from user space to denote size of array fields (offsets/ref_ctr_offsets/cookies). The actual array size is stored back to uprobe_multi.count (allowing user to find out the actual array size) and array fields are populated up to the user passed size. All the non-array fields (path/count/flags/pid) are always set. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20231125193130.834322-4-jolsa@kernel.org
2023-11-28bpf: Store ref_ctr_offsets values in bpf_uprobe arrayJiri Olsa
We will need to return ref_ctr_offsets values through link_info interface in following change, so we need to keep them around. Storing ref_ctr_offsets values directly into bpf_uprobe array. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20231125193130.834322-3-jolsa@kernel.org
2023-11-28cgroup_freezer: cgroup_freezing: Check if not frozenTim Van Patten
__thaw_task() was recently updated to warn if the task being thawed was part of a freezer cgroup that is still currently freezing: void __thaw_task(struct task_struct *p) { ... if (WARN_ON_ONCE(freezing(p))) goto unlock; This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is asserted, the CGROUP_FREEZING bits are not also cleared at the same time. Meaning, when a cgroup is marked FROZEN it continues to be marked FREEZING as well. This causes the WARNING to trigger, because cgroup_freezing() thinks the cgroup is still freezing. There are two ways to fix this: 1. Whenever FROZEN is set, clear FREEZING for the cgroup and all children cgroups. 2. Update cgroup_freezing() to also verify that FROZEN is not set. This patch implements option (2), since it's smaller and more straightforward. Signed-off-by: Tim Van Patten <timvp@google.com> Tested-by: Mark Hasemeyer <markhas@chromium.org> Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Tejun Heo <tj@kernel.org>
2023-11-28cgroup/cpuset: Expose cpuset.cpus.isolatedWaiman Long
The root-only cpuset.cpus.isolated control file shows the current set of isolated CPUs in isolated partitions. This control file is currently exposed only with the cgroup_debug boot command line option which also adds the ".__DEBUG__." prefix. This is actually a useful control file if users want to find out which CPUs are currently in an isolated state by the cpuset controller. Remove CFTYPE_DEBUG flag for this control file and make it available by default without any prefix. The test_cpuset_prs.sh test script and the cgroup-v2.rst documentation file are also updated accordingly. Minor code change is also made in test_cpuset_prs.sh to avoid false test failure when running on debug kernel. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>