summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-05-28Merge tag 'bpf-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-28Merge tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "The marquee feature for this release is that the limit on the maximum rsize and wsize has been raised to 4MB. The default remains at 1MB, but risk-seeking administrators now have the ability to try larger I/O sizes with NFS clients that support them. Eventually the default setting will be increased when we have confidence that this change will not have negative impact. With v6.16, NFSD now has its own debugfs file system where we can add experimental features and make them available outside of our development community without impacting production deployments. The first experimental setting added is one that makes all NFS READ operations use vfs_iter_read() instead of the NFSD splice actor. The plan is to eventually retire the splice actor, as that will enable a number of new capabilities such as the use of struct bio_vec from the top to the bottom of the NFSD stack. Jeff Layton contributed a number of observability improvements. The use of dprintk() in a number of high-traffic code paths has been replaced with static trace points. This release sees the continuation of efforts to harden the NFSv4.2 COPY operation. Soon, the restriction on async COPY operations can be lifted. Many thanks to the contributors, reviewers, testers, and bug reporters who participated during the v6.16 development cycle" * tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (60 commits) xdrgen: Fix code generated for counted arrays SUNRPC: Bump the maximum payload size for the server NFSD: Add a "default" block size NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro NFSD: Remove NFSD_BUFSIZE sunrpc: Remove the RPCSVC_MAXPAGES macro svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages sunrpc: Adjust size of socket's receive page array dynamically SUNRPC: Remove svc_rqst :: rq_vec SUNRPC: Remove svc_fill_write_vector() NFSD: Use rqstp->rq_bvec in nfsd_iter_write() SUNRPC: Export xdr_buf_to_bvec() NFSD: De-duplicate the svc_fill_write_vector() call sites NFSD: Use rqstp->rq_bvec in nfsd_iter_read() sunrpc: Replace the rq_bvec array with dynamically-allocated memory sunrpc: Replace the rq_pages array with dynamically-allocated memory sunrpc: Remove backchannel check in svc_init_buffer() sunrpc: Add a helper to derive maxpages from sv_max_mesg svcrdma: Reduce the number of rdma_rw contexts per-QP ...
2025-05-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Merge in late fixes to prepare for the 6.16 net-next PR. No conflicts nor adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28net: openvswitch: Fix the dead loop of MPLS parseFaicker Mo
The unexpected MPLS packet may not end with the bottom label stack. When there are many stacks, The label count value has wrapped around. A dead loop occurs, soft lockup/CPU stuck finally. stack backtrace: UBSAN: array-index-out-of-bounds in /build/linux-0Pa0xK/linux-5.15.0/net/openvswitch/flow.c:662:26 index -1 is out of range for type '__be32 [3]' CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G OE 5.15.0-121-generic #131-Ubuntu Hardware name: Dell Inc. PowerEdge C6420/0JP9TF, BIOS 2.12.2 07/14/2021 Call Trace: <IRQ> show_stack+0x52/0x5c dump_stack_lvl+0x4a/0x63 dump_stack+0x10/0x16 ubsan_epilogue+0x9/0x36 __ubsan_handle_out_of_bounds.cold+0x44/0x49 key_extract_l3l4+0x82a/0x840 [openvswitch] ? kfree_skbmem+0x52/0xa0 key_extract+0x9c/0x2b0 [openvswitch] ovs_flow_key_extract+0x124/0x350 [openvswitch] ovs_vport_receive+0x61/0xd0 [openvswitch] ? kernel_init_free_pages.part.0+0x4a/0x70 ? get_page_from_freelist+0x353/0x540 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23a/0xf00 __netif_receive_skb_list_core+0xfa/0x240 netif_receive_skb_list_internal+0x18e/0x2a0 napi_complete_done+0x7a/0x1c0 bnxt_poll+0x155/0x1c0 [bnxt_en] __napi_poll+0x30/0x180 net_rx_action+0x126/0x280 ? bnxt_msix+0x67/0x80 [bnxt_en] handle_softirqs+0xda/0x2d0 irq_exit_rcu+0x96/0xc0 common_interrupt+0x8e/0xa0 </IRQ> Fixes: fbdcdd78da7c ("Change in Openvswitch to support MPLS label depth of 3 in ingress direction") Signed-off-by: Faicker Mo <faicker.mo@zenlayer.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/259D3404-575D-4A6D-B263-1DF59A67CF89@zenlayer.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28calipso: Don't call calipso functions for AF_INET sk.Kuniyuki Iwashima
syzkaller reported a null-ptr-deref in txopt_get(). [0] The offset 0x70 was of struct ipv6_txoptions in struct ipv6_pinfo, so struct ipv6_pinfo was NULL there. However, this never happens for IPv6 sockets as inet_sk(sk)->pinet6 is always set in inet6_create(), meaning the socket was not IPv6 one. The root cause is missing validation in netlbl_conn_setattr(). netlbl_conn_setattr() switches branches based on struct sockaddr.sa_family, which is passed from userspace. However, netlbl_conn_setattr() does not check if the address family matches the socket. The syzkaller must have called connect() for an IPv6 address on an IPv4 socket. We have a proper validation in tcp_v[46]_connect(), but security_socket_connect() is called in the earlier stage. Let's copy the validation to netlbl_conn_setattr(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 2 UID: 0 PID: 12928 Comm: syz.9.1677 Not tainted 6.12.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:txopt_get include/net/ipv6.h:390 [inline] RIP: 0010: Code: 02 00 00 49 8b ac 24 f8 02 00 00 e8 84 69 2a fd e8 ff 00 16 fd 48 8d 7d 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 53 02 00 00 48 8b 6d 70 48 85 ed 0f 84 ab 01 00 RSP: 0018:ffff88811b8afc48 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: 1ffff11023715f8a RCX: ffffffff841ab00c RDX: 000000000000000e RSI: ffffc90007d9e000 RDI: 0000000000000070 RBP: 0000000000000000 R08: ffffed1023715f9d R09: ffffed1023715f9e R10: ffffed1023715f9d R11: 0000000000000003 R12: ffff888123075f00 R13: ffff88810245bd80 R14: ffff888113646780 R15: ffff888100578a80 FS: 00007f9019bd7640(0000) GS:ffff8882d2d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f901b927bac CR3: 0000000104788003 CR4: 0000000000770ef0 PKRU: 80000000 Call Trace: <TASK> calipso_sock_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:557 netlbl_conn_setattr+0x10c/0x280 net/netlabel/netlabel_kapi.c:1177 selinux_netlbl_socket_connect_helper+0xd3/0x1b0 security/selinux/netlabel.c:569 selinux_netlbl_socket_connect_locked security/selinux/netlabel.c:597 [inline] selinux_netlbl_socket_connect+0xb6/0x100 security/selinux/netlabel.c:615 selinux_socket_connect+0x5f/0x80 security/selinux/hooks.c:4931 security_socket_connect+0x50/0xa0 security/security.c:4598 __sys_connect_file+0xa4/0x190 net/socket.c:2067 __sys_connect+0x12c/0x170 net/socket.c:2088 __do_sys_connect net/socket.c:2098 [inline] __se_sys_connect net/socket.c:2095 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:2095 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xaa/0x1b0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f901b61a12d Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f9019bd6fa8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f901b925fa0 RCX: 00007f901b61a12d RDX: 000000000000001c RSI: 0000200000000140 RDI: 0000000000000003 RBP: 00007f901b701505 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f901b5b62a0 R15: 00007f9019bb7000 </TASK> Modules linked in: Fixes: ceba1832b1b2 ("calipso: Set the calipso socket label to match the secattr.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: John Cheung <john.cs.hey@gmail.com> Closes: https://lore.kernel.org/netdev/CAP=Rh=M1LzunrcQB1fSGauMrJrhL6GGps5cPAKzHJXj6GQV+-g@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250522221858.91240-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-28net_sched: hfsc: Address reentrant enqueue adding class to eltree twicePedro Tammela
Savino says: "We are writing to report that this recent patch (141d34391abbb315d68556b7c67ad97885407547) [1] can be bypassed, and a UAF can still occur when HFSC is utilized with NETEM. The patch only checks the cl->cl_nactive field to determine whether it is the first insertion or not [2], but this field is only incremented by init_vf [3]. By using HFSC_RSC (which uses init_ed) [4], it is possible to bypass the check and insert the class twice in the eltree. Under normal conditions, this would lead to an infinite loop in hfsc_dequeue for the reasons we already explained in this report [5]. However, if TBF is added as root qdisc and it is configured with a very low rate, it can be utilized to prevent packets from being dequeued. This behavior can be exploited to perform subsequent insertions in the HFSC eltree and cause a UAF." To fix both the UAF and the infinite loop, with netem as an hfsc child, check explicitly in hfsc_enqueue whether the class is already in the eltree whenever the HFSC_RSC flag is set. [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=141d34391abbb315d68556b7c67ad97885407547 [2] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1572 [3] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L677 [4] https://elixir.bootlin.com/linux/v6.15-rc5/source/net/sched/sch_hfsc.c#L1574 [5] https://lore.kernel.org/netdev/8DuRWwfqjoRDLDmBMlIfbrsZg9Gx50DHJc1ilxsEBNe2D6NMoigR_eIRIG0LOjMc3r10nUUZtArXx4oZBIdUfZQrwjcQhdinnMis_0G7VEk=@willsroot.io/T/#u Fixes: 37d9cf1a3ce3 ("sched: Fix detection of empty queues in child qdiscs") Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Reported-by: William Liu <will@willsroot.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20250522181448.1439717-2-pctammela@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: devmem: preserve sockc_errMina Almasry
Preserve the error code returned by sock_cmsg_send and return that on err. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250523230524.1107879-4-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27page_pool: fix ugly page_pool formattingMina Almasry
Minor cleanup; this line is badly formatted. Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250523230524.1107879-3-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27net: devmem: move list_add to net_devmem_bind_dmabuf.Mina Almasry
It's annoying for the list_add to be outside net_devmem_bind_dmabuf, but the list_del is in net_devmem_unbind_dmabuf. Make it consistent by having both the list_add/del be inside the net_devmem_[un]bind_dmabuf. Cc: ap420073@gmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Tested-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250523230524.1107879-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27sctp: mark sctp_do_peeloff staticChristoph Hellwig
sctp_do_peeloff is only used inside of net/sctp/socket.c, so mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20250526054745.2329201-1-hch@lst.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATORSaeed Mahameed
GENERIC_ALLOCATOR is a non-prompt kconfig, meaning users can't enable it selectively. All kconfig users of GENERIC_ALLOCATOR select it, except of NET_DEVMEM which only depends on it, there is no easy way to turn GENERIC_ALLOCATOR on unless we select other unnecessary configs that will select it. Instead of depending on it, select it when NET_DEVMEM is enabled. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/1747950086-1246773-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27tipc: use kfree_sensitive() for aead cleanupZilin Guan
The tipc_aead_free() function currently uses kfree() to release the aead structure. However, this structure contains sensitive information, such as key's SALT value, which should be securely erased from memory to prevent potential leakage. To enhance security, replace kfree() with kfree_sensitive() when freeing the aead structure. This change ensures that sensitive data is explicitly cleared before memory deallocation, aligning with the approach used in tipc_aead_init() and adhering to best practices for handling confidential information. Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Link: https://patch.msgid.link/20250523114717.4021518-1-zilin@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-27Merge tag 'timers-cleanups-2025-05-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "Another set of timer API cleanups: - Convert init_timer*(), try_to_del_timer_sync() and destroy_timer_on_stack() over to the canonical timer_*() namespace convention. There is another large conversion pending, which has not been included because it would have caused a gazillion of merge conflicts in next. The conversion scripts will be run towards the end of the merge window and a pull request sent once all conflict dependencies have been merged" * tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try() timers: Rename init_timers() as timers_init() timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA timers: Rename __init_timer_on_stack() as __timer_init_on_stack() timers: Rename __init_timer() as __timer_init() timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack() timers: Rename init_timer_key() as timer_init_key()
2025-05-27xsk: add missing virtual address conversion for pageBui Quang Minh
In commit 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()"), when converting from netmem to page, I missed a call to page_address() around skb_frag_page(frag) to get the virtual address of the page. This commit uses skb_frag_address() helper to fix the issue. Fixes: 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()") Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Link: https://patch.msgid.link/20250522040115.5057-1-minhquangbui99@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27af_packet: move notifier's packet_dev_mc out of rcu critical sectionStanislav Fomichev
Syzkaller reports the following issue: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:578 __mutex_lock+0x106/0xe80 kernel/locking/mutex.c:746 team_change_rx_flags+0x38/0x220 drivers/net/team/team_core.c:1781 dev_change_rx_flags net/core/dev.c:9145 [inline] __dev_set_promiscuity+0x3f8/0x590 net/core/dev.c:9189 netif_set_promiscuity+0x50/0xe0 net/core/dev.c:9201 dev_set_promiscuity+0x126/0x260 net/core/dev_api.c:286 packet_dev_mc net/packet/af_packet.c:3698 [inline] packet_dev_mclist_delete net/packet/af_packet.c:3722 [inline] packet_notifier+0x292/0xa60 net/packet/af_packet.c:4247 notifier_call_chain+0x1b3/0x3e0 kernel/notifier.c:85 call_netdevice_notifiers_extack net/core/dev.c:2214 [inline] call_netdevice_notifiers net/core/dev.c:2228 [inline] unregister_netdevice_many_notify+0x15d8/0x2330 net/core/dev.c:11972 rtnl_delete_link net/core/rtnetlink.c:3522 [inline] rtnl_dellink+0x488/0x710 net/core/rtnetlink.c:3564 rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6955 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534 Calling `PACKET_ADD_MEMBERSHIP` on an ops-locked device can trigger the `NETDEV_UNREGISTER` notifier, which may require disabling promiscuous and/or allmulti mode. Both of these operations require acquiring the netdev instance lock. Move the call to `packet_dev_mc` outside of the RCU critical section. The `mclist` modifications (add, del, flush, unregister) are protected by the RTNL, not the RCU. The RCU only protects the `sklist` and its associated `sks`. The delayed operation on the `mclist` entry remains within the RTNL. Reported-by: syzbot+b191b5ccad8d7a986286@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b191b5ccad8d7a986286 Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations") Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250522031129.3247266-1-stfomichev@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27vsock: Move lingering logic to af_vsock coreMichal Luczaj
Lingering should be transport-independent in the long run. In preparation for supporting other transports, as well as the linger on shutdown(), move code to core. Generalize by querying vsock_transport::unsent_bytes(), guard against the callback being unimplemented. Do not pass sk_lingertime explicitly. Pull SOCK_LINGER check into vsock_linger(). Flatten the function. Remove the nested block by inverting the condition: return early on !timeout. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27vsock/virtio: Linger on unsent dataMichal Luczaj
Currently vsock's lingering effectively boils down to waiting (or timing out) until packets are consumed or dropped by the peer; be it by receiving the data, closing or shutting down the connection. To align with the semantics described in the SO_LINGER section of man socket(7) and to mimic AF_INET's behaviour more closely, change the logic of a lingering close(): instead of waiting for all data to be handled, block until data is considered sent from the vsock's transport point of view. That is until worker picks the packets for processing and decrements virtio_vsock_sock::bytes_unsent down to 0. Note that (some interpretation of) lingering was always limited to transports that called virtio_transport_wait_close() on transport release. This does not change, i.e. under Hyper-V and VMCI no lingering would be observed. The implementation does not adhere strictly to man page's interpretation of SO_LINGER: shutdown() will not trigger the lingering. This follows AF_INET. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250522-vsock-linger-v6-1-2ad00b0e447e@rbox.co Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert dev_set_mac_address_user() to use struct sockaddr_storageKees Cook
Convert callers of dev_set_mac_address_user() to use struct sockaddr_storage. Add sanity checks on dev->addr_len usage. Signed-off-by: Kees Cook <kees@kernel.org> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-8-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27rtnetlink: do_setlink: Use struct sockaddr_storageKees Cook
Instead of a heap allocating a variably sized struct sockaddr and lying about the type in the call to netif_set_mac_address(), use a stack allocated struct sockaddr_storage. This lets us drop the cast and avoid the allocation. Putting "ss" on the stack means it will get a reused stack slot since it is the same size (128B) as other existing single-scope stack variables, like the vfinfo array (128B), so no additional stack space is used by this function. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-7-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert dev_set_mac_address() to struct sockaddr_storageKees Cook
All users of dev_set_mac_address() are now using a struct sockaddr_storage. Convert the internal data type to struct sockaddr_storage, drop the casts, and update pointer types. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-6-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27ieee802154: Use struct sockaddr_storage with dev_set_mac_address()Kees Cook
Switch to struct sockaddr_storage for calling dev_set_mac_address(). Add a temporary cast to struct sockaddr, which will be removed in a subsequent patch. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-4-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net/ncsi: Use struct sockaddr_storage for pending_macKees Cook
To avoid future casting with coming API type changes, switch struct ncsi_dev_priv::pending_mac to a full struct sockaddr_storage. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-3-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Switch netif_set_mac_address() to struct sockaddr_storageKees Cook
In order to avoid passing around struct sockaddr that has a size the compiler cannot reason about (nor track at runtime), convert netif_set_mac_address() to take struct sockaddr_storage. This is just a cast conversion, so there is are no binary changes. Following patches will make actual allocation changes. Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-2-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-27net: core: Convert inet_addr_is_any() to sockaddr_storageKees Cook
All the callers of inet_addr_is_any() have a sockaddr_storage-backed sockaddr. Avoid casts and switch prototype to the actual object being used. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-1-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26replace strncpy with strscpy_padBaris Can Goral
The strncpy() function is actively dangerous to use since it may not NULL-terminate the destination string, resulting in potential memory content exposures, unbounded reads, or crashes. Link: https://github.com/KSPP/linux/issues/90 In addition, strscpy_pad is more appropriate because it also zero-fills any remaining space in the destination if the source is shorter than the provided buffer size. Signed-off-by: Baris Can Goral <goralbaris@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250521161036.14489-1-goralbaris@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.coredump' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull coredump updates from Christian Brauner: "This adds support for sending coredumps over an AF_UNIX socket. It also makes (implicit) use of the new SO_PEERPIDFD ability to hand out pidfds for reaped peer tasks The new coredump socket will allow userspace to not have to rely on usermode helpers for processing coredumps and provides a saf way to handle them instead of relying on super privileged coredumping helpers This will also be significantly more lightweight since the kernel doens't have to do a fork()+exec() for each crashing process to spawn a usermodehelper. Instead the kernel just connects to the AF_UNIX socket and userspace can process it concurrently however it sees fit. Support for userspace is incoming starting with systemd-coredump There's more work coming in that direction next cycle. The rest below goes into some details and background Coredumping currently supports two modes: (1) Dumping directly into a file somewhere on the filesystem. (2) Dumping into a pipe connected to a usermode helper process spawned as a child of the system_unbound_wq or kthreadd For simplicity I'm mostly ignoring (1). There's probably still some users of (1) out there but processing coredumps in this way can be considered adventurous especially in the face of set*id binaries The most common option should be (2) by now. It works by allowing userspace to put a string into /proc/sys/kernel/core_pattern like: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h The "|" at the beginning indicates to the kernel that a pipe must be used. The path following the pipe indicator is a path to a binary that will be spawned as a usermode helper process. Any additional parameters pass information about the task that is generating the coredump to the binary that processes the coredump In the example the core_pattern shown causes the kernel to spawn systemd-coredump as a usermode helper. There's various conceptual consequences of this (non-exhaustive list): - systemd-coredump is spawned with file descriptor number 0 (stdin) connected to the read-end of the pipe. All other file descriptors are closed. That specifically includes 1 (stdout) and 2 (stderr). This has already caused bugs because userspace assumed that this cannot happen (Whether or not this is a sane assumption is irrelevant) - systemd-coredump will be spawned as a child of system_unbound_wq. So it is not a child of any userspace process and specifically not a child of PID 1. It cannot be waited upon and is in a weird hybrid upcall which are difficult for userspace to control correctly - systemd-coredump is spawned with full kernel privileges. This necessitates all kinds of weird privilege dropping excercises in userspace to make this safe - A new usermode helper has to be spawned for each crashing process This adds a new mode: (3) Dumping into an AF_UNIX socket Userspace can set /proc/sys/kernel/core_pattern to: @/path/to/coredump.socket The "@" at the beginning indicates to the kernel that an AF_UNIX coredump socket will be used to process coredumps The coredump socket must be located in the initial mount namespace. When a task coredumps it opens a client socket in the initial network namespace and connects to the coredump socket: - The coredump server uses SO_PEERPIDFD to get a stable handle on the connected crashing task. The retrieved pidfd will provide a stable reference even if the crashing task gets SIGKILLed while generating the coredump. That is a huge attack vector right now - By setting core_pipe_limit non-zero userspace can guarantee that the crashing task cannot be reaped behind it's back and thus process all necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to detect whether /proc/<pid> still refers to the same process The core_pipe_limit isn't used to rate-limit connections to the socket. This can simply be done via AF_UNIX socket directly - The pidfd for the crashing task will contain information how the task coredumps. The PIDFD_GET_INFO ioctl gained a new flag PIDFD_INFO_COREDUMP which can be used to retreive the coredump information If the coredump gets a new coredump client connection the kernel guarantees that PIDFD_INFO_COREDUMP information is available. Currently the following information is provided in the new @coredump_mask extension to struct pidfd_info: * PIDFD_COREDUMPED is raised if the task did actually coredump * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., undumpable) * PIDFD_COREDUMP_USER is raised if this is a regular coredump and doesn't need special care by the coredump server * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be treated as sensitive and the coredump server should restrict access to the generated coredump to sufficiently privileged users" * tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: mips, net: ensure that SOCK_COREDUMP is defined selftests/coredump: add tests for AF_UNIX coredumps selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure coredump: validate socket name as it is written coredump: show supported coredump modes pidfs, coredump: add PIDFD_INFO_COREDUMP coredump: add coredump socket coredump: reflow dump helpers a little coredump: massage do_coredump() coredump: massage format_corename()
2025-05-26Merge tag 'vfs-6.16-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "Features: - Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD socket option SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for the process that called connect() or listen(). This is heavily used to safely authenticate clients in userspace avoiding security bugs due to pid recycling races (dbus, polkit, systemd, etc.) SO_PEERPIDFD currently doesn't support handing out pidfds if the sk->sk_peer_pid thread-group leader has already been reaped. In this case it currently returns EINVAL. Userspace still wants to get a pidfd for a reaped process to have a stable handle it can pass on. This is especially useful now that it is possible to retrieve exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s PIDFD_INFO_EXIT flag Another summary has been provided by David Rheinsberg: > A pidfd can outlive the task it refers to, and thus user-space > must already be prepared that the task underlying a pidfd is > gone at the time they get their hands on the pidfd. For > instance, resolving the pidfd to a PID via the fdinfo must be > prepared to read `-1`. > > Despite user-space knowing that a pidfd might be stale, several > kernel APIs currently add another layer that checks for this. In > particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was > already reaped, but returns a stale pidfd if the task is reaped > immediately after the respective alive-check. > > This has the unfortunate effect that user-space now has two ways > to check for the exact same scenario: A syscall might return > EINVAL/ESRCH/... *or* the pidfd might be stale, even though > there is no particular reason to distinguish both cases. This > also propagates through user-space APIs, which pass on pidfds. > They must be prepared to pass on `-1` *or* the pidfd, because > there is no guaranteed way to get a stale pidfd from the kernel. > > Userspace must already deal with a pidfd referring to a reaped > task as the task may exit and get reaped at any time will there > are still many pidfds referring to it In order to allow handing out reaped pidfd SO_PEERPIDFD needs to ensure that PIDFD_INFO_EXIT information is available whenever a pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi promises that reaped pidfds are only handed out if it is guaranteed that the caller sees the exit information: TEST_F(pidfd_info, success_reaped) { struct pidfd_info info = { .mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT, }; /* * Process has already been reaped and PIDFD_INFO_EXIT been set. * Verify that we can retrieve the exit status of the process. */ ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0); ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS)); ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT)); ASSERT_TRUE(WIFEXITED(info.exit_code)); ASSERT_EQ(WEXITSTATUS(info.exit_code), 0); } To hand out pidfds for reaped processes we thus allocate a pidfs entry for the relevant sk->sk_peer_pid at the time the sk->sk_peer_pid is stashed and drop it when the socket is destroyed. This guarantees that exit information will always be recorded for the sk->sk_peer_pid task and we can hand out pidfds for reaped processes - Hand a pidfd to the coredump usermode helper process Give userspace a way to instruct the kernel to install a pidfd for the crashing process into the process started as a usermode helper. There's still tricky race-windows that cannot be easily or sometimes not closed at all by userspace. There's various ways like looking at the start time of a process to make sure that the usermode helper process is started after the crashing process but it's all very very brittle and fraught with peril The crashed-but-not-reaped process can be killed by userspace before coredump processing programs like systemd-coredump have had time to manually open a PIDFD from the PID the kernel provides them, which means they can be tricked into reading from an arbitrary process, and they run with full privileges as they are usermode helper processes Even if that specific race-window wouldn't exist it's still the safest and cleanest way to let the kernel provide the pidfd directly instead of requiring userspace to do it manually. In parallel with this commit we already have systemd adding support for this in [1] When the usermode helper process is forked we install a pidfd file descriptor three into the usermode helper's file descriptor table so it's available to the exec'd program Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed yet and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited - Allow telling when a task has not been found from finding the wrong task when creating a pidfd We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader This is causing issues for non-threaded workloads as in where they expect ESRCH to be reported, not EINVAL So allow userspace to reliably distinguish between (1) and (2) - Make it possible to detect when a pidfs entry would outlive the struct pid it pinned - Add a range of new selftests Cleanups: - Remove unneeded NULL check from pidfd_prepare() for passed struct pid - Avoid pointless reference count bump during release_task() Fixes: - Various fixes to the pidfd and coredump selftests - Fix error handling for replace_fd() when spawning coredump usermode helper" * tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: detect refcount bugs coredump: hand a pidfd to the usermode coredump helper coredump: fix error handling for replace_fd() pidfs: move O_RDWR into pidfs_alloc_file() selftests: coredump: Raise timeout to 2 minutes selftests: coredump: Fix test failure for slow machines selftests: coredump: Properly initialize pointer net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid pidfs: get rid of __pidfd_prepare() net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid pidfs: register pid in pidfs net, pidfd: report EINVAL for ESRCH release_task: kill the no longer needed get/put_pid(thread_pid) pidfs: ensure consistent ENOENT/ESRCH reporting exit: move wake_up_all() pidfd waiters into __unhash_process() selftest/pidfd: add test for thread-group leader pidfd open for thread pidfd: improve uapi when task isn't found pidfd: remove unneeded NULL check from pidfd_prepare() selftests/pidfd: adapt to recent changes
2025-05-26vsock/virtio: fix `rx_bytes` accounting for stream socketsStefano Garzarella
In `struct virtio_vsock_sock`, we maintain two counters: - `rx_bytes`: used internally to track how many bytes have been read. This supports mechanisms like .stream_has_data() and sock_rcvlowat(). - `fwd_cnt`: used for the credit mechanism to inform available receive buffer space to the remote peer. These counters are updated via virtio_transport_inc_rx_pkt() and virtio_transport_dec_rx_pkt(). Since the beginning with commit 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko"), we call virtio_transport_dec_rx_pkt() in virtio_transport_stream_do_dequeue() only when we consume the entire packet, so partial reads, do not update `rx_bytes` and `fwd_cnt`. This is fine for `fwd_cnt`, because we still have space used for the entire packet, and we don't want to update the credit for the other peer until we free the space of the entire packet. However, this causes `rx_bytes` to be stale on partial reads. Previously, this didn’t cause issues because `rx_bytes` was used only by .stream_has_data(), and any unread portion of a packet implied data was still available. However, since commit 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages"), we now rely on `rx_bytes` to determine if a credit update should be sent when the data in the RX queue drops below SO_RCVLOWAT value. This patch fixes the accounting by updating `rx_bytes` with the number of bytes actually read, even on partial reads, while leaving `fwd_cnt` untouched until the packet is fully consumed. Also introduce a new `buf_used` counter to check that the remote peer is honoring the given credit; this was previously done via `rx_bytes`. Fixes: 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250521121705.196379-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'nf-next-25-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'ipsec-next-2025-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Remove some unnecessary strscpy_pad() size arguments. From Thorsten Blum. 2) Correct use of xso.real_dev on bonding offloads. Patchset from Cosmin Ratiu. 3) Add hardware offload configuration to XFRM_MSG_MIGRATE. From Chiachang Wang. 4) Refactor migration setup during cloning. This was done after the clone was created. Now it is done in the cloning function itself. From Chiachang Wang. 5) Validate assignment of maximal possible SEQ number. Prevent from setting to the maximum sequrnce number as this would cause for traffic drop. From Leon Romanovsky. 6) Prevent configuration of interface index when offload is used. Hardware can't handle this case.i From Leon Romanovsky. 7) Always use kfree_sensitive() for SA secret zeroization. From Zilin Guan. ipsec-next-2025-05-23 * tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: use kfree_sensitive() for SA secret zeroization xfrm: prevent configuration of interface index when offload is used xfrm: validate assignment of maximal possible SEQ number xfrm: Refactor migration setup during the cloning process xfrm: Migrate offload configuration bonding: Fix multiple long standing offload races bonding: Mark active offloaded xfrm_states xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free} xfrm: Remove unneeded device check from validate_xmit_xfrm xfrm: Use xdo.dev instead of xdo.real_dev net/mlx5: Avoid using xso.real_dev unnecessarily xfrm: Remove unnecessary strscpy_pad() size arguments ==================== Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26net: mctp: use nlmsg_payload() for netlink message data extractionJeremy Kerr
Jakub suggests: > I have a different request :) Matt, once this ends up in net-next > (end of this week) could you refactor it to use nlmsg_payload() ? > It doesn't exist in net but this is exactly why it was added. This refactors the additions to both mctp_dump_addrinfo(), and mctp_rtm_getneigh() - two cases where we're calling nlh_data() on an an incoming netlink message, without a prior nlmsg_parse(). For the neigh.c case, we cannot hit the failure where the nlh does not contain a full ndmsg at present, as the core handler (net/core/neighbour.c, neigh_get()) has already validated the size through neigh_valid_req_get(), and would have failed the get operation before the MCTP hander is called. However, relying on that is a bit fragile, so apply the nlmsg_payload refector here too. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250521-mctp-nlmsg-payload-v2-1-e85df160c405@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-26net: neigh: use kfree_skb_reason() in neigh_resolve_output() and ↵Qiu Yutan
neigh_connected_output() Replace kfree_skb() used in neigh_resolve_output() and neigh_connected_output() with kfree_skb_reason(). Following new skb drop reason is added: /* failed to fill the device hard header */ SKB_DROP_REASON_NEIGH_HH_FILLFAIL Signed-off-by: Qiu Yutan <qiu.yutan@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-26net: devmem: support single IOV with sendmsgStanislav Fomichev
sendmsg() with a single iov becomes ITER_UBUF, sendmsg() with multiple iovs becomes ITER_IOVEC. iter_iov_len does not return correct value for UBUF, so teach to treat UBUF differently. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Acked-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-23netfilter: nf_tables: Add notifications for hook changesPhil Sutter
Notify user space if netdev hooks are updated due to netdev add/remove events. Send minimal notification messages by introducing NFT_MSG_NEWDEV/DELDEV message types describing a single device only. Upon NETDEV_CHANGENAME, the callback has no information about the interface's old name. To provide a clear message to user space, include the hook's stored interface name in the notification. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Support wildcard netdev hook specsPhil Sutter
User space may pass non-nul-terminated NFTA_DEVICE_NAME attribute values to indicate a suffix wildcard. Expect for multiple devices to match the given prefix in nft_netdev_hook_alloc() and populate 'ops_list' with them all. When checking for duplicate hooks, compare the shortest prefix so a device may never match more than a single hook spec. Finally respect the stored prefix length when hooking into new devices from event handlers. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()Phil Sutter
No point in having err_hook_alloc, just call return directly. Also rename err_hook_dev - it's not about the hook's device but freeing the hook itself. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Handle NETDEV_CHANGENAME eventsPhil Sutter
For the sake of simplicity, treat them like consecutive NETDEV_REGISTER and NETDEV_UNREGISTER events. If the new name matches a hook spec and registration fails, escalate the error and keep things as they are. To avoid unregistering the newly registered hook again during the following fake NETDEV_UNREGISTER event, leave hooks alone if their interface spec matches the new name. Note how this patch also skips for NETDEV_REGISTER if the device is already registered. This is not yet possible as the new name would have to match the old one. This will change with wildcard interface specs, though. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Wrap netdev notifiersPhil Sutter
Handling NETDEV_CHANGENAME events has to traverse all chains/flowtables twice, prepare for this. No functional change intended. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Respect NETDEV_REGISTER eventsPhil Sutter
Hook into new devices if their name matches the hook spec. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Prepare for handling NETDEV_REGISTER eventsPhil Sutter
Put NETDEV_UNREGISTER handling code into a switch, no functional change intended as the function is only called for that event yet. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Have a list of nf_hook_ops in nft_hookPhil Sutter
Supporting a 1:n relationship between nft_hook and nf_hook_ops is convenient since a chain's or flowtable's nft_hooks may remain in place despite matching interfaces disappearing. This stabilizes ruleset dumps in that regard and opens the possibility to claim newly added interfaces which match the spec. Also it prepares for wildcard interface specs since these will potentially match multiple interfaces. All spots dealing with hook registration are updated to handle a list of multiple nf_hook_ops, but nft_netdev_hook_alloc() only adds a single item for now to retain the old behaviour. The only expected functional change here is how vanishing interfaces are handled: Instead of dropping the respective nft_hook, only the matching nf_hook_ops are dropped. To safely remove individual ops from the list in netdev handlers, an rcu_head is added to struct nf_hook_ops so kfree_rcu() may be used. There is at least nft_flowtable_find_dev() which may be iterating through the list at the same time. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()Phil Sutter
The function accesses only the hook's ops field, pass it directly. This prepares for nft_hooks holding a list of nf_hook_ops in future. While at it, make use of the function in __nft_unregister_flowtable_net_hooks() as well. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce nft_register_flowtable_ops()Phil Sutter
Facilitate binding and registering of a flowtable hook via a single function call. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()Phil Sutter
Also a pretty dull wrapper around the hook->ops.dev comparison for now. Will search the embedded nf_hook_ops list in future. The ugly cast to eliminate the const qualifier will vanish then, too. Since this future list will be RCU-protected, also introduce an _rcu() variant here. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: Introduce functions freeing nft_hook objectsPhil Sutter
Pointless wrappers around kfree() for now, prep work for an embedded list of nf_hook_ops. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: add packets conntrack state to debug trace infoFlorian Westphal
Add the minimal relevant info needed for userspace ("nftables monitor trace") to provide the conntrack view of the packet: - state (new, related, established) - direction (original, reply) - status (e.g., if connection is subject to dnat) - id (allows to query ctnetlink for remaining conntrack state info) Example: trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..] [..] trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32 trace id a62 inet filter PRE_MANGLE packet: [..] [..] trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32 [..] In this case one can see that while NAT is active, the new connection isn't subject to a translation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: conntrack: make nf_conntrack_id callable without a module dependencyFlorian Westphal
While nf_conntrack_id() doesn't need any functionaliy from conntrack, it does reside in nf_conntrack_core.c -- callers add a module dependency on conntrack. Followup patch will need to compute the conntrack id from nf_tables_trace.c to include it in nf_trace messages emitted to userspace via netlink. I don't want to introduce a module dependency between nf_tables and conntrack for this. Since trace is slowpath, the added indirection is ok. One alternative is to move nf_conntrack_id to the netfilter/core.c, but I don't see a compelling reason so far. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmitSebastian Andrzej Siewior
nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>