summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-09bpf: Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_placeHou Tao
Rename __htab_percpu_map_update_elem to htab_map_update_elem_in_place, and add a new percpu argument for the helper to support in-place update for both per-cpu htab and htab of maps. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Factor out htab_elem_value helper()Hou Tao
All hash maps store map key and map value together. The relative offset of the map value compared to the map key is round_up(key_size, 8). Therefore, factor out a common helper htab_elem_value() to calculate the address of the map value instead of duplicating the logic. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250401062250.543403-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Clarify the meaning of BPF_F_PSEUDO_HDRPaul Chaignon
In the bpf_l4_csum_replace helper, the BPF_F_PSEUDO_HDR flag should only be set if the modified header field is part of the pseudo-header. If you modify for example the UDP ports and pass BPF_F_PSEUDO_HDR, inet_proto_csum_replace4 will update skb->csum even though it shouldn't (the port and the UDP checksum updates null each other). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/5126ef84ba75425b689482cbc98bffe75e5d8ab0.1744102490.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: Clarify role of BPF_F_RECOMPUTE_CSUMPaul Chaignon
BPF_F_RECOMPUTE_CSUM doesn't update the actual L3 and L4 checksums in the packet, but simply updates skb->csum (according to skb->ip_summed). This patch clarifies that to avoid confusions. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ff6895d42936f03dbb82334d8bcfd50e00c79086.1744102490.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09Merge branch 'support-skf_net_off-and-skf_ll_off-on-skb-frags'Alexei Starovoitov
Willem de Bruijn says: ==================== support SKF_NET_OFF and SKF_LL_OFF on skb frags From: Willem de Bruijn <willemb@google.com> Address a longstanding issue that may lead to missed packets depending on system configuration. Ensure that reading from packet contents works regardless of skb geometry, also when using the special SKF_.. negative offsets to offset from L2 or L3 header. Patch 2 is the selftest for the fix. v2->v3 - do not remove bpf_internal_load_pointer_neg_helper, because it is still used in the sparc32 JIT v1->v2 - introduce bfp_skb_load_helper_convert_offset to avoid open coding - selftest: add comment why early demux must be disabled v2: https://lore.kernel.org/netdev/20250404142633.1955847-1-willemdebruijn.kernel@gmail.com/ v1: https://lore.kernel.org/netdev/20250403140846.1268564-1-willemdebruijn.kernel@gmail.com/ ==================== Link: https://patch.msgid.link/20250408132833.195491-1-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09selftests/net: test sk_filter support for SKF_NET_OFF on fragsWillem de Bruijn
Verify that a classic BPF linux socket filter correctly matches packet contents. Including when accessing contents in an skb_frag. 1. Open a SOCK_RAW socket with a classic BPF filter on UDP dport 8000. 2. Open a tap device with IFF_NAPI_FRAGS to inject skbs with frags. 3. Send a packet for which the UDP header is in frag[0]. 4. Receive this packet to demonstrate that the socket accepted it. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20250408132833.195491-3-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: support SKF_NET_OFF and SKF_LL_OFF on skb fragsWillem de Bruijn
Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to read when these offsets extend into frags. This has been observed with iwlwifi and reproduced with tun with IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port, applied to a RAW socket, will silently miss matching packets. const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt); const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest); struct sock_filter filter_code[] = { BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4), BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_NET_OFF + offset_proto), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2), BPF_STMT(BPF_LD + BPF_H + BPF_ABS, SKF_NET_OFF + offset_dport), This is unexpected behavior. Socket filter programs should be consistent regardless of environment. Silent misses are particularly concerning as hard to detect. Use skb_copy_bits for offsets outside linear, same as done for non-SKF_(LL|NET) offsets. Offset is always positive after subtracting the reference threshold SKB_(LL|NET)_OFF, so is always >= skb_(mac|network)_offset. The sum of the two is an offset against skb->data, and may be negative, but it cannot point before skb->head, as skb_(mac|network)_offset would too. This appears to go back to when frag support was introduced to sk_run_filter in linux-2.4.4, before the introduction of git. The amount of code change and 8/16/32 bit duplication are unfortunate. But any attempt I made to be smarter saved very few LoC while complicating the code. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/20250122200402.3461154-1-maze@google.com/ Link: https://elixir.bootlin.com/linux/2.4.4/source/net/core/filter.c#L244 Reported-by: Matt Moeller <moeller.matt@gmail.com> Co-developed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250408132833.195491-2-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09Merge branch 'bpf-sockmap-fix-data-loss-and-panic-issues'Alexei Starovoitov
Jiayuan Chen says: ==================== bpf, sockmap: Fix data loss and panic issues I was writing a benchmark based on sockmap + TCP and discovered several issues: 1. When EAGAIN occurs, the direction of skb is incorrect, causing data loss when retry. 2. When sending partial data, the offset is not recorded, leading to duplicate data being sent when retry. 3. An unexpected BUG_ON() judgment in skb_linearize is triggered. 4. The memory of psock->ingress_skb is not limited by the socket buffer and memcg. Issues 1, 2, and 3 are described in each patch's commit message. Regarding issue 4, this patchset does not cover it as it is difficult to handle in practice, and I am still working on it. Here is a brief description of the issue: When using sockmap to skb/stream redirect, if the receiving end does not perform read operations, all data will be buffered in ingress_skb. For example: ''' // set memory limit to 50G cgcreate -g memory:myGroup cgset -r memory.max="5000M" myGroup // start benchmark and disable consumer from reading cgexec -g "memory:myGroup" ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --delay-consumer=-1 -d 100 Iter 0 ( 29.179us): Send Speed 2668.548 MB/s (20360.406 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 1 ( -7.237us): Send Speed 2694.467 MB/s (20557.149 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 2 ( -1.918us): Send Speed 2693.404 MB/s (20548.039 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 3 ( -0.684us): Send Speed 2693.138 MB/s (20548.014 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 4 ( 7.879us): Send Speed 2698.620 MB/s (20588.838 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 5 ( -3.224us): Send Speed 2696.553 MB/s (20573.066 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 6 ( -5.409us): Send Speed 2699.705 MB/s (20597.111 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) Iter 7 ( -0.439us): Send Speed 2699.691 MB/s (20597.009 calls/s), ... Rcv Speed 0.000 MB/s ( 0.000 calls/s) ... // memory usage are not limited cat /proc/slabinfo | grep skb skbuff_small_head 11824024 11824024 704 46 8 : tunables 0 0 0 : slabdata 257044 257044 0 skbuff_fclone_cache 11822080 11822080 512 32 4 : tunables 0 0 0 : slabdata 369440 369440 0 ''' Thus, a simple socket in a large file upload/download model can eat the entire OS memory. We must charge the skb memory to psock->sk, and if we do not want losing skb, we need to feedback the error info to read_sock/read_skb when the enqueue operation of psock->ingress_skb fails. --- My another patch related to stability also requires maintainers to spare some time from their busy schedules for review. https://lore.kernel.org/bpf/20250317092257.68760-1-jiayuan.chen@linux.dev/T/#t ==================== Link: https://patch.msgid.link/20250407142234.47591-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09selftest/bpf/benchs: Add benchmark for sockmap usageJiayuan Chen
Add TCP+sockmap-based benchmark. Since sockmap's own update and delete operations are generally less critical, the performance of the fast forwarding framework built upon it is the key aspect. Also with cgset/cgexec, we can observe the behavior of sockmap under memory pressure. The benchmark can be run with: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress ''' In the future, we plan to move socket_helpers.h out of the prog_tests directory to make it accessible for the benchmark. This will enable better support for various socket types. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-5-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf, sockmap: Fix panic when calling skb_linearizeJiayuan Chen
The panic can be reproduced by executing the command: ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress --rx-strp 100000 Then a kernel panic was captured: ''' [ 657.460555] kernel BUG at net/core/skbuff.c:2178! [ 657.462680] Tainted: [W]=WARN [ 657.463287] Workqueue: events sk_psock_backlog ... [ 657.469610] <TASK> [ 657.469738] ? die+0x36/0x90 [ 657.469916] ? do_trap+0x1d0/0x270 [ 657.470118] ? pskb_expand_head+0x612/0xf40 [ 657.470376] ? pskb_expand_head+0x612/0xf40 [ 657.470620] ? do_error_trap+0xa3/0x170 [ 657.470846] ? pskb_expand_head+0x612/0xf40 [ 657.471092] ? handle_invalid_op+0x2c/0x40 [ 657.471335] ? pskb_expand_head+0x612/0xf40 [ 657.471579] ? exc_invalid_op+0x2d/0x40 [ 657.471805] ? asm_exc_invalid_op+0x1a/0x20 [ 657.472052] ? pskb_expand_head+0xd1/0xf40 [ 657.472292] ? pskb_expand_head+0x612/0xf40 [ 657.472540] ? lock_acquire+0x18f/0x4e0 [ 657.472766] ? find_held_lock+0x2d/0x110 [ 657.472999] ? __pfx_pskb_expand_head+0x10/0x10 [ 657.473263] ? __kmalloc_cache_noprof+0x5b/0x470 [ 657.473537] ? __pfx___lock_release.isra.0+0x10/0x10 [ 657.473826] __pskb_pull_tail+0xfd/0x1d20 [ 657.474062] ? __kasan_slab_alloc+0x4e/0x90 [ 657.474707] sk_psock_skb_ingress_enqueue+0x3bf/0x510 [ 657.475392] ? __kasan_kmalloc+0xaa/0xb0 [ 657.476010] sk_psock_backlog+0x5cf/0xd70 [ 657.476637] process_one_work+0x858/0x1a20 ''' The panic originates from the assertion BUG_ON(skb_shared(skb)) in skb_linearize(). A previous commit(see Fixes tag) introduced skb_get() to avoid race conditions between skb operations in the backlog and skb release in the recvmsg path. However, this caused the panic to always occur when skb_linearize is executed. The "--rx-strp 100000" parameter forces the RX path to use the strparser module which aggregates data until it reaches 100KB before calling sockmap logic. The 100KB payload exceeds MAX_MSG_FRAGS, triggering skb_linearize. To fix this issue, just move skb_get into sk_psock_skb_ingress_enqueue. ''' sk_psock_backlog: sk_psock_handle_skb skb_get(skb) <== we move it into 'sk_psock_skb_ingress_enqueue' sk_psock_skb_ingress____________ ↓ | | → sk_psock_skb_ingress_self | sk_psock_skb_ingress_enqueue sk_psock_verdict_apply_________________↑ skb_linearize ''' Note that for verdict_apply path, the skb_get operation is unnecessary so we add 'take_ref' param to control it's behavior. Fixes: a454d84ee20b ("bpf, sockmap: Fix skb refcnt race after locking changes") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-4-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf, sockmap: fix duplicated data transmissionJiayuan Chen
In the !ingress path under sk_psock_handle_skb(), when sending data to the remote under snd_buf limitations, partial skb data might be transmitted. Although we preserved the partial transmission state (offset/length), the state wasn't properly consumed during retries. This caused the retry path to resend the entire skb data instead of continuing from the previous offset, resulting in data overlap at the receiver side. Fixes: 405df89dd52c ("bpf, sockmap: Improved check for empty queue") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf, sockmap: Fix data lost during EAGAIN retriesJiayuan Chen
We call skb_bpf_redirect_clear() to clean _sk_redir before handling skb in backlog, but when sk_psock_handle_skb() return EAGAIN due to sk_rcvbuf limit, the redirect info in _sk_redir is not recovered. Fix skb redir loss during EAGAIN retries by restoring _sk_redir information using skb_bpf_set_redir(). Before this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1343.172 MB/s, BPF Speed 1343.238 MB/s, Rcv Speed 65.271 MB/s Send Speed 1352.022 MB/s, BPF Speed 1352.088 MB/s, Rcv Speed 0 MB/s Send Speed 1354.105 MB/s, BPF Speed 1354.105 MB/s, Rcv Speed 0 MB/s Send Speed 1355.018 MB/s, BPF Speed 1354.887 MB/s, Rcv Speed 0 MB/s ''' Due to the high send rate, the RX processing path may frequently hit the sk_rcvbuf limit. Once triggered, incorrect _sk_redir will cause the flow to mistakenly enter the "!ingress" path, leading to send failures. (The Rcv speed depends on tcp_rmem). After this patch: ''' ./bench sockmap -c 2 -p 1 -a --rx-verdict-ingress Setting up benchmark 'sockmap'... create socket fd c1:13 p1:14 c2:15 p2:16 Benchmark 'sockmap' started. Send Speed 1347.236 MB/s, BPF Speed 1347.367 MB/s, Rcv Speed 65.402 MB/s Send Speed 1353.320 MB/s, BPF Speed 1353.320 MB/s, Rcv Speed 65.536 MB/s Send Speed 1353.186 MB/s, BPF Speed 1353.121 MB/s, Rcv Speed 65.536 MB/s ''' Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20250407142234.47591-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09Merge branch 'bpf-fix-ktls-panic-with-sockmap-and-add-tests'Alexei Starovoitov
Jiayuan Chen says: ==================== bpf: fix ktls panic with sockmap and add tests We can reproduce the issue using the existing test program: './test_sockmap --ktls' Or use the selftest I provided, which will cause a panic: ------------[ cut here ]------------ kernel BUG at lib/iov_iter.c:629! PKRU: 55555554 Call Trace: <TASK> ? die+0x36/0x90 ? do_trap+0xdd/0x100 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x178/0x180 ? do_error_trap+0x7d/0x110 ? iov_iter_revert+0x178/0x180 ? exc_invalid_op+0x50/0x70 ? iov_iter_revert+0x178/0x180 ? asm_exc_invalid_op+0x1a/0x20 ? iov_iter_revert+0x178/0x180 ? iov_iter_revert+0x5c/0x180 tls_sw_sendmsg_locked.isra.0+0x794/0x840 tls_sw_sendmsg+0x52/0x80 ? inet_sendmsg+0x1f/0x70 __sys_sendto+0x1cd/0x200 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? __lock_release.isra.0+0x5e/0x170 ? find_held_lock+0x2b/0x80 ? syscall_trace_enter+0x140/0x270 ? lockdep_hardirqs_on_prepare+0xda/0x190 ? ktime_get_coarse_real_ts64+0xc2/0xd0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x90/0x170 1. It looks like the issue started occurring after bpf being introduced to ktls and later the addition of assertions to iov_iter has caused a panic. If my fix tag is incorrect, please assist me in correcting the fix tag. 2. I make minimal changes for now, it's enough to make ktls work correctly. --- v1->v2: Added more content to the commit message https://lore.kernel.org/all/20250123171552.57345-1-mrpre@163.com/#r --- ==================== Link: https://patch.msgid.link/20250219052015.274405-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09selftests/bpf: add ktls selftestJiayuan Chen
add ktls selftest for sockmap Test results: sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls disconnect_after_delete IPv4 SOCKMAP:OK sockmap_ktls/sockmap_ktls update_fails_when_sock_has_ulp IPv4 SOCKMAP:OK sockmap_ktls/tls simple offload:OK sockmap_ktls/tls tx cork:OK sockmap_ktls/tls tx cork with push:OK sockmap_ktls/tls simple offload:OK sockmap_ktls/tls tx cork:OK sockmap_ktls/tls tx cork with push:OK sockmap_ktls:OK Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250219052015.274405-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09bpf: fix ktls panic with sockmapJiayuan Chen
[ 2172.936997] ------------[ cut here ]------------ [ 2172.936999] kernel BUG at lib/iov_iter.c:629! ...... [ 2172.944996] PKRU: 55555554 [ 2172.945155] Call Trace: [ 2172.945299] <TASK> [ 2172.945428] ? die+0x36/0x90 [ 2172.945601] ? do_trap+0xdd/0x100 [ 2172.945795] ? iov_iter_revert+0x178/0x180 [ 2172.946031] ? iov_iter_revert+0x178/0x180 [ 2172.946267] ? do_error_trap+0x7d/0x110 [ 2172.946499] ? iov_iter_revert+0x178/0x180 [ 2172.946736] ? exc_invalid_op+0x50/0x70 [ 2172.946961] ? iov_iter_revert+0x178/0x180 [ 2172.947197] ? asm_exc_invalid_op+0x1a/0x20 [ 2172.947446] ? iov_iter_revert+0x178/0x180 [ 2172.947683] ? iov_iter_revert+0x5c/0x180 [ 2172.947913] tls_sw_sendmsg_locked.isra.0+0x794/0x840 [ 2172.948206] tls_sw_sendmsg+0x52/0x80 [ 2172.948420] ? inet_sendmsg+0x1f/0x70 [ 2172.948634] __sys_sendto+0x1cd/0x200 [ 2172.948848] ? find_held_lock+0x2b/0x80 [ 2172.949072] ? syscall_trace_enter+0x140/0x270 [ 2172.949330] ? __lock_release.isra.0+0x5e/0x170 [ 2172.949595] ? find_held_lock+0x2b/0x80 [ 2172.949817] ? syscall_trace_enter+0x140/0x270 [ 2172.950211] ? lockdep_hardirqs_on_prepare+0xda/0x190 [ 2172.950632] ? ktime_get_coarse_real_ts64+0xc2/0xd0 [ 2172.951036] __x64_sys_sendto+0x24/0x30 [ 2172.951382] do_syscall_64+0x90/0x170 ...... After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase, e.g., when the BPF program executes bpf_msg_push_data(). If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes, it will return -ENOSPC and attempt to roll back to the non-zero copy logic. However, during rollback, msg->msg_iter is reset, but since msg_pl->sg.size has been increased, subsequent executions will exceed the actual size of msg_iter. ''' iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size); ''' The changes in this commit are based on the following considerations: 1. When cork_bytes is set, rolling back to non-zero copy logic is pointless and can directly go to zero-copy logic. 2. We can not calculate the correct number of bytes to revert msg_iter. Assume the original data is "abcdefgh" (8 bytes), and after 3 pushes by the BPF program, it becomes 11-byte data: "abc?de?fgh?". Then, we set cork_bytes to 6, which means the first 6 bytes have been processed, and the remaining 5 bytes "?fgh?" will be cached until the length meets the cork_bytes requirement. However, some data in "?fgh?" is not within 'sg->msg_iter' (but in msg_pl instead), especially the data "?" we pushed. So it doesn't seem as simple as just reverting through an offset of msg_iter. 3. For non-TLS sockets in tcp_bpf_sendmsg, when a "cork" situation occurs, the user-space send() doesn't return an error, and the returned length is the same as the input length parameter, even if some data is cached. Additionally, I saw that the current non-zero-copy logic for handling corking is written as: ''' line 1177 else if (ret != -EAGAIN) { if (ret == -ENOSPC) ret = 0; goto send_end; ''' So it's ok to just return 'copied' without error when a "cork" situation occurs. Fixes: fcb14cb1bdac ("new iov_iter flavour - ITER_UBUF") Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20250219052015.274405-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-09net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod.Kuniyuki Iwashima
When I ran the repro [0] and waited a few seconds, I observed two LOCKDEP splats: a warning immediately followed by a null-ptr-deref. [1] Reproduction Steps: 1) Mount CIFS 2) Add an iptables rule to drop incoming FIN packets for CIFS 3) Unmount CIFS 4) Unload the CIFS module 5) Remove the iptables rule At step 3), the CIFS module calls sock_release() for the underlying TCP socket, and it returns quickly. However, the socket remains in FIN_WAIT_1 because incoming FIN packets are dropped. At this point, the module's refcnt is 0 while the socket is still alive, so the following rmmod command succeeds. # ss -tan State Recv-Q Send-Q Local Address:Port Peer Address:Port FIN-WAIT-1 0 477 10.0.2.15:51062 10.0.0.137:445 # lsmod | grep cifs cifs 1159168 0 This highlights a discrepancy between the lifetime of the CIFS module and the underlying TCP socket. Even after CIFS calls sock_release() and it returns, the TCP socket does not die immediately in order to close the connection gracefully. While this is generally fine, it causes an issue with LOCKDEP because CIFS assigns a different lock class to the TCP socket's sk->sk_lock using sock_lock_init_class_and_name(). Once an incoming packet is processed for the socket or a timer fires, sk->sk_lock is acquired. Then, LOCKDEP checks the lock context in check_wait_context(), where hlock_class() is called to retrieve the lock class. However, since the module has already been unloaded, hlock_class() logs a warning and returns NULL, triggering the null-ptr-deref. If LOCKDEP is enabled, we must ensure that a module calling sock_lock_init_class_and_name() (CIFS, NFS, etc) cannot be unloaded while such a socket is still alive to prevent this issue. Let's hold the module reference in sock_lock_init_class_and_name() and release it when the socket is freed in sk_prot_free(). Note that sock_lock_init() clears sk->sk_owner for svc_create_socket() that calls sock_lock_init_class_and_name() for a listening socket, which clones a socket by sk_clone_lock() without GFP_ZERO. [0]: CIFS_SERVER="10.0.0.137" CIFS_PATH="//${CIFS_SERVER}/Users/Administrator/Desktop/CIFS_TEST" DEV="enp0s3" CRED="/root/WindowsCredential.txt" MNT=$(mktemp -d /tmp/XXXXXX) mount -t cifs ${CIFS_PATH} ${MNT} -o vers=3.0,credentials=${CRED},cache=none,echo_interval=1 iptables -A INPUT -s ${CIFS_SERVER} -j DROP for i in $(seq 10); do umount ${MNT} rmmod cifs sleep 1 done rm -r ${MNT} iptables -D INPUT -s ${CIFS_SERVER} -j DROP [1]: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 10 PID: 0 at kernel/locking/lockdep.c:234 hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.14.0 #36 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:hlock_class (kernel/locking/lockdep.c:234 kernel/locking/lockdep.c:223) ... Call Trace: <IRQ> __lock_acquire (kernel/locking/lockdep.c:4853 kernel/locking/lockdep.c:5178) lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ... BUG: kernel NULL pointer dereference, address: 00000000000000c4 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Tainted: G W 6.14.0 #36 Tainted: [W]=WARN Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire (kernel/locking/lockdep.c:4852 kernel/locking/lockdep.c:5178) Code: 15 41 09 c7 41 8b 44 24 20 25 ff 1f 00 00 41 09 c7 8b 84 24 a0 00 00 00 45 89 7c 24 20 41 89 44 24 24 e8 e1 bc ff ff 4c 89 e7 <44> 0f b6 b8 c4 00 00 00 e8 d1 bc ff ff 0f b6 80 c5 00 00 00 88 44 RSP: 0018:ffa0000000468a10 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ff1100010091cc38 RCX: 0000000000000027 RDX: ff1100081f09ca48 RSI: 0000000000000001 RDI: ff1100010091cc88 RBP: ff1100010091c200 R08: ff1100083fe6e228 R09: 00000000ffffbfff R10: ff1100081eca0000 R11: ff1100083fe10dc0 R12: ff1100010091cc88 R13: 0000000000000001 R14: 0000000000000000 R15: 00000000000424b1 FS: 0000000000000000(0000) GS:ff1100081f080000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c4 CR3: 0000000002c4a003 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> lock_acquire (kernel/locking/lockdep.c:469 kernel/locking/lockdep.c:5853 kernel/locking/lockdep.c:5816) _raw_spin_lock_nested (kernel/locking/spinlock.c:379) tcp_v4_rcv (./include/linux/skbuff.h:1678 ./include/net/tcp.h:2547 net/ipv4/tcp_ipv4.c:2350) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205 (discriminator 1)) ip_local_deliver_finish (./include/linux/rcupdate.h:878 net/ipv4/ip_input.c:234) ip_sublist_rcv_finish (net/ipv4/ip_input.c:576) ip_list_rcv_finish (net/ipv4/ip_input.c:628) ip_list_rcv (net/ipv4/ip_input.c:670) __netif_receive_skb_list_core (net/core/dev.c:5939 net/core/dev.c:5986) netif_receive_skb_list_internal (net/core/dev.c:6040 net/core/dev.c:6129) napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:519 ./include/net/gro.h:514 net/core/dev.c:6496) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3815) __napi_poll.constprop.0 (net/core/dev.c:7191) net_rx_action (net/core/dev.c:7262 net/core/dev.c:7382) handle_softirqs (kernel/softirq.c:561) __irq_exit_rcu (kernel/softirq.c:596 kernel/softirq.c:435 kernel/softirq.c:662) irq_exit_rcu (kernel/softirq.c:680) common_interrupt (arch/x86/kernel/irq.c:280 (discriminator 14)) </IRQ> <TASK> asm_common_interrupt (./arch/x86/include/asm/idtentry.h:693) RIP: 0010:default_idle (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:92 arch/x86/kernel/process.c:744) Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d c3 2b 15 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 RSP: 0018:ffa00000000ffee8 EFLAGS: 00000202 RAX: 000000000000640b RBX: ff1100010091c200 RCX: 0000000000061aa4 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff812f30c5 RBP: 000000000000000a R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ? do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) default_idle_call (./include/linux/cpuidle.h:143 kernel/sched/idle.c:118) do_idle (kernel/sched/idle.c:186 kernel/sched/idle.c:325) cpu_startup_entry (kernel/sched/idle.c:422 (discriminator 1)) start_secondary (arch/x86/kernel/smpboot.c:315) common_startup_64 (arch/x86/kernel/head_64.S:421) </TASK> Modules linked in: cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CR2: 00000000000000c4 Fixes: ed07536ed673 ("[PATCH] lockdep: annotate nfs/nfsd in-kernel sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250407163313.22682-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09ipv6: Align behavior across nexthops during path selectionIdo Schimmel
A nexthop is only chosen when the calculated multipath hash falls in the nexthop's hash region (i.e., the hash is smaller than the nexthop's hash threshold) and when the nexthop is assigned a non-negative score by rt6_score_route(). Commit 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") introduced an unintentional difference between the first nexthop and the rest when the score is negative. When the first nexthop matches, but has a negative score, the code will currently evaluate subsequent nexthops until one is found with a non-negative score. On the other hand, when a different nexthop matches, but has a negative score, the code will fallback to the nexthop with which the selection started ('match'). Align the behavior across all nexthops and fallback to 'match' when the first nexthop matches, but has a negative score. Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Fixes: 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop") Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/netdev/67efef607bc41_1ddca82948c@willemb.c.googlers.com.notmuch/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250408084316.243559-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: phy: allow MDIO bus PM ops to start/stop state machine for ↵Vladimir Oltean
phylink-controlled PHY DSA has 2 kinds of drivers: 1. Those who call dsa_switch_suspend() and dsa_switch_resume() from their device PM ops: qca8k-8xxx, bcm_sf2, microchip ksz 2. Those who don't: all others. The above methods should be optional. For type 1, dsa_switch_suspend() calls dsa_user_suspend() -> phylink_stop(), and dsa_switch_resume() calls dsa_user_resume() -> phylink_start(). These seem good candidates for setting mac_managed_pm = true because that is essentially its definition [1], but that does not seem to be the biggest problem for now, and is not what this change focuses on. Talking strictly about the 2nd category of DSA drivers here (which do not have MAC managed PM, meaning that for their attached PHYs, mdio_bus_phy_suspend() and mdio_bus_phy_resume() should run in full), I have noticed that the following warning from mdio_bus_phy_resume() is triggered: WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY && phydev->state != PHY_UP); because the PHY state machine is running. It's running as a result of a previous dsa_user_open() -> ... -> phylink_start() -> phy_start() having been initiated by the user. The previous mdio_bus_phy_suspend() was supposed to have called phy_stop_machine(), but it didn't. So this is why the PHY is in state PHY_NOLINK by the time mdio_bus_phy_resume() runs. mdio_bus_phy_suspend() did not call phy_stop_machine() because for phylink, the phydev->adjust_link function pointer is NULL. This seems a technicality introduced by commit fddd91016d16 ("phylib: fix PAL state machine restart on resume"). That commit was written before phylink existed, and was intended to avoid crashing with consumer drivers which don't use the PHY state machine - phylink always does, when using a PHY. But phylink itself has historically not been developed with suspend/resume in mind, and apparently not tested too much in that scenario, allowing this bug to exist unnoticed for so long. Plus, prior to the WARN_ON(), it would have likely been invisible. This issue is not in fact restricted to type 2 DSA drivers (according to the above ad-hoc classification), but can be extrapolated to any MAC driver with phylink and MDIO-bus-managed PHY PM ops. DSA is just where the issue was reported. Assuming mac_managed_pm is set correctly, a quick search indicates the following other drivers might be affected: $ grep -Zlr PHYLINK_NETDEV drivers/ | xargs -0 grep -L mac_managed_pm drivers/net/ethernet/atheros/ag71xx.c drivers/net/ethernet/microchip/sparx5/sparx5_main.c drivers/net/ethernet/microchip/lan966x/lan966x_main.c drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c drivers/net/ethernet/freescale/dpaa/dpaa_eth.c drivers/net/ethernet/freescale/ucc_geth.c drivers/net/ethernet/freescale/enetc/enetc_pf_common.c drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c drivers/net/ethernet/marvell/mvneta.c drivers/net/ethernet/marvell/prestera/prestera_main.c drivers/net/ethernet/mediatek/mtk_eth_soc.c drivers/net/ethernet/altera/altera_tse_main.c drivers/net/ethernet/wangxun/txgbe/txgbe_phy.c drivers/net/ethernet/meta/fbnic/fbnic_phylink.c drivers/net/ethernet/tehuti/tn40_phy.c drivers/net/ethernet/mscc/ocelot_net.c Make the existing conditions dependent on the PHY device having a phydev->phy_link_change() implementation equal to the default phy_link_change() provided by phylib. Otherwise, we implicitly know that the phydev has the phylink-provided phylink_phy_change() callback, and when phylink is used, the PHY state machine always needs to be stopped/ started on the suspend/resume path. The code is structured as such that if phydev->phy_link_change() is absent, it is a matter of time until the kernel will crash - no need to further complicate the test. Thus, for the situation where the PM is not managed by the MAC, we will make the MDIO bus PM ops treat identically the phylink-controlled PHYs with the phylib-controlled PHYs where an adjust_link() callback is supplied. In both cases, the MDIO bus PM ops should stop and restart the PHY state machine. [1] https://lore.kernel.org/netdev/Z-1tiW9zjcoFkhwc@shell.armlinux.org.uk/ Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state") Reported-by: Wei Fang <wei.fang@nxp.com> Tested-by: Wei Fang <wei.fang@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250407094042.2155633-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09net: phy: move phy_link_change() prior to mdio_bus_phy_may_suspend()Vladimir Oltean
In an upcoming change, mdio_bus_phy_may_suspend() will need to distinguish a phylib-based PHY client from a phylink PHY client. For that, it will need to compare the phydev->phy_link_change() function pointer with the eponymous phy_link_change() provided by phylib. To avoid forward function declarations, the default PHY link state change method should be moved upwards. There is no functional change associated with this patch, it is only to reduce the noise from a real bug fix. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250407093900.2155112-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-09selftests/bpf: Fix bpf_nf selftest failureSaket Kumar Bhaskar
For systems with missing iptables-legacy tool this selftest fails. Add check to find if iptables-legacy tool is available and skip the test if the tool is missing. Fixes: de9c8d848d90 ("selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test") Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250409095633.33653-1-skb99@linux.ibm.com
2025-04-09bpf: Check link_create.flags parameter for multi_uprobeTao Chen
The link_create.flags are currently not used for multi-uprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd to have an arbitrary value for multi-uprobe, though, as there are existing users (libbpf) relying on this. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-2-chen.dylane@linux.dev
2025-04-09bpf: Check link_create.flags parameter for multi_kprobeTao Chen
The link_create.flags are currently not used for multi-kprobes, so return -EINVAL if it is set, same as for other attach APIs. We allow target_fd, on the other hand, to have an arbitrary value for multi-kprobe, as there are existing users (libbpf) relying on this. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250407035752.1108927-1-chen.dylane@linux.dev
2025-04-09Merge branch 'libbpf-introduce-line_info-and-func_info-getters'Andrii Nakryiko
Mykyta Yatsenko says: ==================== libbpf: introduce line_info and func_info getters From: Mykyta Yatsenko <yatsenko@meta.com> This patchset introduces new libbpf API getters that enable the retrieval of .BTF.ext line and func info. This change enables users to load bpf_program directly using bpf_prog_load, bypassing the higher-level bpf_object__load API. Providing line and function info is essential for BPF program verification in some cases. v3 -> v5 * Fix tests on s390x, nits. v2 -> v3 * Return ENOTSUPP if func or line info struct size differs from the one in uapi linux headers. * Add selftests. v1 -> v2 Move bpf_line_info_min and bpf_func_info_min from libbpf_internal.h to btf.h. Did not remove _min suffix, because there already are bpf_line_info and bpf_func_info structs in uapi/../bpf.h. ==================== Link: https://patch.msgid.link/20250408234417.452565-1-mykyta.yatsenko5@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2025-04-09selftests/bpf: Add BTF.ext line/func info getter testsMykyta Yatsenko
Add selftests checking that line and func info retrieved by newly added libbpf APIs are the same as returned by kernel via bpf_prog_get_info_by_fd. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408234417.452565-3-mykyta.yatsenko5@gmail.com
2025-04-09libbpf: Add getters for BTF.ext func and line infoMykyta Yatsenko
Introducing new libbpf API getters for BTF.ext func and line info, namely: bpf_program__func_info bpf_program__func_info_cnt bpf_program__line_info bpf_program__line_info_cnt This change enables scenarios, when user needs to load bpf_program directly using `bpf_prog_load`, instead of higher-level `bpf_object__load`. Line and func info are required for checking BTF info in verifier; verification may fail without these fields if, for example, program calls `bpf_obj_new`. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408234417.452565-2-mykyta.yatsenko5@gmail.com
2025-04-09selftests/bpf: Support struct/union presets in veristatMykyta Yatsenko
Extend commit e3c9abd0d14b ("selftests/bpf: Implement setting global variables in veristat") to support applying presets to members of the global structs or unions in veristat. For example: ``` ./veristat set_global_vars.bpf.o -G "union1.struct3.var_u8_h = 0xBB" ``` Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250408104544.140317-1-mykyta.yatsenko5@gmail.com
2025-04-09Merge tag 'linux_kselftest-fixes-6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest fixes from Shuah Khan: - Fixes tpm2, futex, and mincore tests - Create a dedicated .gitignore for tpm2 tests * tag 'linux_kselftest-fixes-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/mincore: Allow read-ahead pages to reach the end of the file selftests/futex: futex_waitv wouldblock test should fail selftests: tpm2: test_smoke: use POSIX-conformant expression operator selftests: tpm2: create a dedicated .gitignore
2025-04-09cifs: Fix querying of WSL CHR and BLK reparse points over SMB1Pali Rohár
When reparse point in SMB1 query_path_info() callback was detected then query also for EA $LXDEV. In this EA are stored device major and minor numbers used by WSL CHR and BLK reparse points. Without major and minor numbers, stat() syscall does not work for char and block devices. Similar code is already in SMB2+ query_path_info() callback function. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09cifs: Split parse_reparse_point callback to functions: get buffer and parse ↵Pali Rohár
buffer Parsing reparse point buffer is generic for all SMB versions and is already implemented by global function parse_reparse_point(). Getting reparse point buffer from the SMB response is SMB version specific, so introduce for it a new callback get_reparse_point_buffer. This functionality split is needed for followup change - getting reparse point buffer without parsing it. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09cifs: Improve handling of name surrogate reparse points in reparse.cPali Rohár
Like previous changes for file inode.c, handle directory name surrogate reparse points generally also in reparse.c. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09cifs: Remove explicit handling of IO_REPARSE_TAG_MOUNT_POINT in inode.cPali Rohár
IO_REPARSE_TAG_MOUNT_POINT is just a specific case of directory Name Surrogate reparse point. As reparse_info_to_fattr() already handles all directory Name Surrogate reparse point (done by the previous change), there is no need to have explicit case for IO_REPARSE_TAG_MOUNT_POINT. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09timekeeping: Add a lockdep override in tick_freeze()Sebastian Andrzej Siewior
tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver acquires a spinlock which becomes a sleeping lock on PREEMPT_RT. Lockdep complains about this lock nesting. Add a lockdep override for this special case and a comment explaining why it is okay. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/ Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/
2025-04-09cxl/pci: Drop the parameter is_port of cxl_gpf_get_dvsec()Li Ming
The first parameter of cxl_gpf_get_dvsec() is a struct device, can be used to distinguish if the device is a cxl dport or a cxl pci device by checking the PCIe type of it, so the parameter is_port is unnecessary to cxl_gpf_get_dvsec(), using parameter struct device is enough. Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://patch.msgid.link/20250323093110.233040-4-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-04-09cxl/pci: Update Port GPF timeout only when the first EP attachingLi Ming
update_gpf_port_dvsec() is used to update GPF Phase timeout, if a CXL switch is under a CXL root port, update_gpf_port_dvsec() will be invoked on the CXL root port when each cxl memory device under the CXL switch is attaching. It is enough to be invoked once, others are redundant. When the first EP attaching, it always triggers its ancestor dports to locate their own Port GPF DVSEC. The change is that invoking update_gpf_port_dvsec() on these ancestor dports after ancestor dport locating a Port GPF DVSEC. It guarantees that update_gpf_port_dvsec() is invoked on a dport only happens during the first EP attaching. Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://patch.msgid.link/20250323093110.233040-3-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-04-09cxl/core: Fix caching dport GPF DVSEC issueLi Ming
Per Table 8-2 in CXL r3.2 section 8.1.1 and CXL r3.2 section 8.1.6, only CXL Downstream switch ports and CXL root ports have GPF DVSEC for CXL Port(DVSEC ID 04h). CXL subsystem has a gpf_dvsec in struct cxl_port which is used to cache the offset of a GPF DVSEC in PCIe configuration space. It will be updated during the first EP attaching to the cxl_port, so the gpf_dvsec can only cache the GPF DVSEC offset of the dport which the first EP is under. Will not have chance to update it during other EPs attaching. That means CXL subsystem will use the same GPF DVSEC offset for all dports under the port, it will be a problem if the GPF DVSEC offset cached in cxl_port is not the right offset for a dport. Moving gpf_dvsec from struct cxl_port to struct cxl_dport, make every cxl dport has their own GPF DVSEC offset caching, and each cxl dport uses its own GPF DVSEC offset for GPF DVSEC accessing. Fixes: a52b6a2c1c99 ("cxl/pci: Support Global Persistent Flush (GPF)") Signed-off-by: Li Ming <ming.li@zohomail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://patch.msgid.link/20250323093110.233040-2-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-04-09x86/ibt: Fix hibernatePeter Zijlstra
Todd reported, and Len confirmed, that commit 582077c94052 ("x86/cfi: Clean up linkage") broke S4 hiberate on a fair number of machines. Turns out these machines trip #CP when trying to restore the image. As it happens, the commit in question removes two ENDBR instructions in the hibernate code, and clearly got it wrong. Notably restore_image() does an indirect jump to relocated_restore_code(), which is a relocated copy of core_restore_code(). In turn, core_restore_code(), will at the end do an indirect jump to restore_jump_address (r8), which is pointing at a relocated restore_registers(). So both sites do indeed need to be ENDBR. Fixes: 582077c94052 ("x86/cfi: Clean up linkage") Reported-by: Todd Brandt <todd.e.brandt@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Len Brown <len.brown@intel.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219998 Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219998
2025-04-09hrtimer: Add missing ACCESS_PRIVATE() for hrtimer::functionNam Cao
The "function" field of struct hrtimer has been changed to private, but two instances have not been converted to use ACCESS_PRIVATE(). Convert them to use ACCESS_PRIVATE(). Fixes: 04257da0c99c ("hrtimers: Make callback function pointer private") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/
2025-04-09cifs: Fix encoding of SMB1 Session Setup Kerberos Request in non-UNICODE modePali Rohár
Like in UNICODE mode, SMB1 Session Setup Kerberos Request contains oslm and domain strings. Extract common code into ascii_oslm_strings() and ascii_domain_string() functions (similar to unicode variants) and use these functions in non-UNICODE code path in sess_auth_kerberos(). Decision if non-UNICODE or UNICODE mode is used is based on the SMBFLG2_UNICODE flag in Flags2 packed field, and not based on the capabilities of server. Fix this check too. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09RDMA/core: Silence oversized kvmalloc() warningShay Drory
syzkaller triggered an oversized kvmalloc() warning. Silence it by adding __GFP_NOWARN. syzkaller log: WARNING: CPU: 7 PID: 518 at mm/util.c:665 __kvmalloc_node_noprof+0x175/0x180 CPU: 7 UID: 0 PID: 518 Comm: c_repro Not tainted 6.11.0-rc6+ #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:__kvmalloc_node_noprof+0x175/0x180 RSP: 0018:ffffc90001e67c10 EFLAGS: 00010246 RAX: 0000000000000100 RBX: 0000000000000400 RCX: ffffffff8149d46b RDX: 0000000000000000 RSI: ffff8881030fae80 RDI: 0000000000000002 RBP: 000000712c800000 R08: 0000000000000100 R09: 0000000000000000 R10: ffffc90001e67c10 R11: 0030ae0601000000 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007fde79159740(0000) GS:ffff88813bdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000180 CR3: 0000000105eb4005 CR4: 00000000003706b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ib_umem_odp_get+0x1f6/0x390 mlx5_ib_reg_user_mr+0x1e8/0x450 ib_uverbs_reg_mr+0x28b/0x440 ib_uverbs_write+0x7d3/0xa30 vfs_write+0x1ac/0x6c0 ksys_write+0x134/0x170 ? __sanitizer_cov_trace_pc+0x1c/0x50 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 37824952dc8f ("RDMA/odp: Use kvcalloc for the dma_list and page_list") Signed-off-by: Shay Drory <shayd@nvidia.com> Link: https://patch.msgid.link/c6cb92379de668be94894f49c2cfa40e73f94d56.1742388096.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-09tracing: Do not add length to print format in synthetic eventsSteven Rostedt
The following causes a vsnprintf fault: # echo 's:wake_lat char[] wakee; u64 delta;' >> /sys/kernel/tracing/dynamic_events # echo 'hist:keys=pid:ts=common_timestamp.usecs if !(common_flags & 0x18)' > /sys/kernel/tracing/events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,next_comm,$delta)' > /sys/kernel/tracing/events/sched/sched_switch/trigger Because the synthetic event's "wakee" field is created as a dynamic string (even though the string copied is not). The print format to print the dynamic string changed from "%*s" to "%s" because another location (__set_synth_event_print_fmt()) exported this to user space, and user space did not need that. But it is still used in print_synth_event(), and the output looks like: <idle>-0 [001] d..5. 193.428167: wake_lat: wakee=(efault)sshd-sessiondelta=155 sshd-session-879 [001] d..5. 193.811080: wake_lat: wakee=(efault)kworker/u34:5delta=58 <idle>-0 [002] d..5. 193.811198: wake_lat: wakee=(efault)bashdelta=91 bash-880 [002] d..5. 193.811371: wake_lat: wakee=(efault)kworker/u35:2delta=21 <idle>-0 [001] d..5. 193.811516: wake_lat: wakee=(efault)sshd-sessiondelta=129 sshd-session-879 [001] d..5. 193.967576: wake_lat: wakee=(efault)kworker/u34:5delta=50 The length isn't needed as the string is always nul terminated. Just print the string and not add the length (which was hard coded to the max string length anyway). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Douglas Raillard <douglas.raillard@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/20250407154139.69955768@gandalf.local.home Fixes: 4d38328eb442d ("tracing: Fix synth event printk format for str fields"); Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-09dma/contiguous: avoid warning about unused size_bytesArnd Bergmann
When building with W=1, this variable is unused for configs with CONFIG_CMA_SIZE_SEL_PERCENTAGE=y: kernel/dma/contiguous.c:67:26: error: 'size_bytes' defined but not used [-Werror=unused-const-variable=] Change this to a macro to avoid the warning. Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250409151557.3890443-1-arnd@kernel.org
2025-04-09smb: client: fix UAF in decryption with multichannelPaulo Alcantara
After commit f7025d861694 ("smb: client: allocate crypto only for primary server") and commit b0abcd65ec54 ("smb: client: fix UAF in async decryption"), the channels started reusing AEAD TFM from primary channel to perform synchronous decryption, but that can't done as there could be multiple cifsd threads (one per channel) simultaneously accessing it to perform decryption. This fixes the following KASAN splat when running fstest generic/249 with 'vers=3.1.1,multichannel,max_channels=4,seal' against Windows Server 2022: BUG: KASAN: slab-use-after-free in gf128mul_4k_lle+0xba/0x110 Read of size 8 at addr ffff8881046c18a0 by task cifsd/986 CPU: 3 UID: 0 PID: 986 Comm: cifsd Not tainted 6.15.0-rc1 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 print_report+0x156/0x528 ? gf128mul_4k_lle+0xba/0x110 ? __virt_addr_valid+0x145/0x300 ? __phys_addr+0x46/0x90 ? gf128mul_4k_lle+0xba/0x110 kasan_report+0xdf/0x1a0 ? gf128mul_4k_lle+0xba/0x110 gf128mul_4k_lle+0xba/0x110 ghash_update+0x189/0x210 shash_ahash_update+0x295/0x370 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_extract_iter_to_sg+0x10/0x10 ? ___kmalloc_large_node+0x10e/0x180 ? __asan_memset+0x23/0x50 crypto_ahash_update+0x3c/0xc0 gcm_hash_assoc_remain_continue+0x93/0xc0 crypt_message+0xe09/0xec0 [cifs] ? __pfx_crypt_message+0x10/0x10 [cifs] ? _raw_spin_unlock+0x23/0x40 ? __pfx_cifs_readv_from_socket+0x10/0x10 [cifs] decrypt_raw_data+0x229/0x380 [cifs] ? __pfx_decrypt_raw_data+0x10/0x10 [cifs] ? __pfx_cifs_read_iter_from_socket+0x10/0x10 [cifs] smb3_receive_transform+0x837/0xc80 [cifs] ? __pfx_smb3_receive_transform+0x10/0x10 [cifs] ? __pfx___might_resched+0x10/0x10 ? __pfx_smb3_is_transform_hdr+0x10/0x10 [cifs] cifs_demultiplex_thread+0x692/0x1570 [cifs] ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] ? rcu_is_watching+0x20/0x50 ? rcu_lockdep_current_cpu_online+0x62/0xb0 ? find_held_lock+0x32/0x90 ? kvm_sched_clock_read+0x11/0x20 ? local_clock_noinstr+0xd/0xd0 ? trace_irq_enable.constprop.0+0xa8/0xe0 ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x60 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Tested-by: David Howells <dhowells@redhat.com> Reported-by: Steve French <stfrench@microsoft.com> Closes: https://lore.kernel.org/r/CAH2r5mu6Yc0-RJXM3kFyBYUB09XmXBrNodOiCVR4EDrmxq5Szg@mail.gmail.com Fixes: f7025d861694 ("smb: client: allocate crypto only for primary server") Fixes: b0abcd65ec54 ("smb: client: fix UAF in async decryption") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-09x86/cpu: Avoid running off the end of an AMD erratum tableDave Hansen
The NULL array terminator at the end of erratum_1386_microcode was removed during the switch from x86_cpu_desc to x86_cpu_id. This causes readers to run off the end of the array. Replace the NULL. Fixes: f3f325152673 ("x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'") Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2025-04-09erofs: fix encoded extents handlingGao Xiang
- The MSB 32 bits of `z_fragmentoff` are available only in extent records of size >= 8B. - Use round_down() to calculate `lstart` as well as increase `pos` correspondingly for extent records of size == 8B. Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408114448.4040220-2-hsiangkao@linux.alibaba.com
2025-04-09erofs: add __packed annotation to union(__le16..)Gao Xiang
I'm unsure why they aren't 2 bytes in size only in arm-linux-gnueabi. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202504051202.DS7QIknJ-lkp@intel.com Fixes: 61ba89b57905 ("erofs: add 48-bit block addressing on-disk support") Fixes: efb2aef569b3 ("erofs: add encoded extent on-disk definition") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408114448.4040220-1-hsiangkao@linux.alibaba.com
2025-04-09erofs: set error to bio if file-backed IO failsSheng Yong
If a file-backed IO fails before submitting the bio to the lower filesystem, an error is returned, but the bio->bi_status is not marked as an error. However, the error information should be passed to the end_io handler. Otherwise, the IO request will be treated as successful. Fixes: 283213718f5d ("erofs: support compressed inodes for fileio") Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250408122351.2104507-1-shengyong1@xiaomi.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-04-09drm/amdgpu/mes12: optimize MES pipe FW version fetchingAlex Deucher
Don't fetch it again if we already have it. It seems the registers don't reliably have the value at resume in some cases. Fixes: 785f0f9fe742 ("drm/amdgpu: Add mes v12_0 ip block support (v4)") Reviewed-by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 9e7b08d239c2f21e8f417854f81e5ff40edbebff) Cc: stable@vger.kernel.org # 6.12.x
2025-04-09drm/amd/pm/smu11: Prevent division by zeroDenis Arefev
The user can set any speed value. If speed is greater than UINT_MAX/8, division by zero is possible. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 1e866f1fe528 ("drm/amd/pm: Prevent divide by zero") Signed-off-by: Denis Arefev <arefev@swemel.ru> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit da7dc714a8f8e1c9fc33c57cd63583779a3bef71) Cc: stable@vger.kernel.org
2025-04-09drm/amdgpu: cancel gfx idle work in device suspend for s0ixAlex Deucher
This is normally handled in the gfx IP suspend callbacks, but for S0ix, those are skipped because we don't want to touch gfx. So handle it in device suspend. Fixes: b9467983b774 ("drm/amdgpu: add dynamic workload profile switching for gfx10") Fixes: 963537ca2325 ("drm/amdgpu: add dynamic workload profile switching for gfx11") Fixes: 5f95a1549555 ("drm/amdgpu: add dynamic workload profile switching for gfx12") Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 906ad451675155380c1dc1881a244ebde8e8df0a) Cc: stable@vger.kernel.org
2025-04-09drm/amd/display: pause the workload setting in dmKenneth Feng
Pause the workload setting in dm when doing idle optimization Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b23f81c442ac33af0c808b4bb26333b881669bb7)