Age | Commit message (Collapse) | Author |
|
The seqcount xt_recseq is used to synchronize the replacement of
xt_table::private in xt_replace_table() against all readers such as
ipt_do_table()
To ensure that there is only one writer, the writing side disables
bottom halves. The sequence counter can be acquired recursively. Only the
first invocation modifies the sequence counter (signaling that a writer
is in progress) while the following (recursive) writer does not modify
the counter.
The lack of a proper locking mechanism for the sequence counter can lead
to live lock on PREEMPT_RT if the high prior reader preempts the
writer. Additionally if the per-CPU lock on PREEMPT_RT is removed from
local_bh_disable() then there is no synchronisation for the per-CPU
sequence counter.
The affected code is "just" the legacy netfilter code which is replaced
by "netfilter tables". That code can be disabled without sacrificing
functionality because everything is provided by the newer
implementation. This will only requires the usage of the "-nft" tools
instead of the "-legacy" ones.
The long term plan is to remove the legacy code so lets accelerate the
progress.
Relax dependencies on iptables legacy, replace select with depends on,
this should cause no harm to existing kernel configs and users can still
toggle IP{6}_NF_IPTABLES_LEGACY in any case.
Make EBTABLES_LEGACY, IPTABLES_LEGACY and ARPTABLES depend on
NETFILTER_XTABLES_LEGACY. Hide xt_recseq and its users,
xt_register_table() and xt_percpu_counter_alloc() behind
NETFILTER_XTABLES_LEGACY. Let NETFILTER_XTABLES_LEGACY depend on
!PREEMPT_RT.
This will break selftest expecing the legacy options enabled and will be
addressed in a following patch.
Co-developed-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.16-rc8).
Conflicts:
drivers/net/ethernet/microsoft/mana/gdma_main.c
9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion")
755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au
Adjacent changes:
drivers/net/ethernet/ti/icssg/icssg_prueth.h
6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG")
ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2025-07-23
1) Premption fixes for xfrm_state_find.
From Sabrina Dubroca.
2) Initialize offload path also for SW IPsec GRO. This fixes a
performance regression on SW IPsec offload.
From Leon Romanovsky.
3) Fix IPsec UDP GRO for IKE packets.
From Tobias Brunner,
4) Fix transport header setting for IPcomp after decompressing.
From Fernando Fernandez Mancera.
5) Fix use-after-free when xfrmi_changelink tries to change
collect_md for a xfrm interface.
From Eyal Birger .
6) Delete the special IPcomp x->tunnel state along with the state x
to avoid refcount problems.
From Sabrina Dubroca.
Please pull or let me know if there are problems.
* tag 'ipsec-2025-07-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
Revert "xfrm: destroy xfrm_state synchronously on net exit path"
xfrm: delete x->tunnel as we delete x
xfrm: interface: fix use-after-free after changing collect_md xfrm interface
xfrm: ipcomp: adjust transport header after decompressing
xfrm: Set transport header to fix UDP GRO handling
xfrm: always initialize offload path
xfrm: state: use a consistent pcpu_id in xfrm_state_find
xfrm: state: initialize state_ptrs earlier in xfrm_state_find
====================
Link: https://patch.msgid.link/20250723075417.3432644-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The mentioned MIB is currently incremented even when a packet
with an old sequence number (i.e. a zero window probe) is received,
which is IMHO misleading.
Explicitly restrict such MIB increment at the relevant events.
Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW")
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20d147292eb4b13b6535e0ad6f56be64d9c330d3.1753118029.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The nipa CI is reporting frequent failures in the mptcp_connect
self-tests.
In the failing scenarios (TCP -> MPTCP) the involved sockets are
actually plain TCP ones, as fallback for passive socket at 2whs
time cause the MPTCP listener to actually create a TCP socket.
The transfer is stuck due to the receiver buffer being zero.
With the stronger check in place, tcp_clamp_window() can be invoked
while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
will be zeroed, too.
Check for the critical condition in tcp_prune_queue() and just
drop the packet without shrinking the receiver buffer.
Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20c18165d3f848e1c5c1b782d88c1a5ab38b3f70.1753118029.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Background
==========
When TCP retransmits a packet due to missing ACKs, the
retransmission may fail for various reasons (e.g., packets
stuck in driver queues, receiver zero windows, or routing issues).
The original tcp_retransmit_skb tracepoint:
'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")'
lacks visibility into these failure causes, making production
diagnostics difficult.
Solution
========
Adds the retval("err") to the tcp_retransmit_skb tracepoint.
Enables users to know why some tcp retransmission failed and
users can filter retransmission failures by retval.
Compatibility description
=========================
This patch extends the tcp_retransmit_skb tracepoint
by adding a new "err" field at the end of its
existing structure (within TP_STRUCT__entry). The
compatibility implications are detailed as follows:
1) Structural compatibility for legacy user-space tools
Legacy tools/BPF programs accessing existing fields
(by offset or name) can still work without modification
or recompilation.The new field is appended to the end,
preserving original memory layout.
2) Note: semantic changes
The original tracepoint primarily only focused on
successfully retransmitted packets. With this patch,
the tracepoint now can figure out packets that may
terminate early due to specific reasons. For accurate
statistics, users should filter using "err" to
distinguish outcomes.
Before patched:
field:const void * skbaddr; offset:8; size:8; signed:0;
field:const void * skaddr; offset:16; size:8; signed:0;
field:int state; offset:24; size:4; signed:1;
field:__u16 sport; offset:28; size:2; signed:0;
field:__u16 dport; offset:30; size:2; signed:0;
field:__u16 family; offset:32; size:2; signed:0;
field:__u8 saddr[4]; offset:34; size:4; signed:0;
field:__u8 daddr[4]; offset:38; size:4; signed:0;
field:__u8 saddr_v6[16]; offset:42; size:16; signed:0;
field:__u8 daddr_v6[16]; offset:58; size:16; signed:0;
print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s"
After patched:
field:const void * skbaddr; offset:8; size:8; signed:0;
field:const void * skaddr; offset:16; size:8; signed:0;
field:int state; offset:24; size:4; signed:1;
field:__u16 sport; offset:28; size:2; signed:0;
field:__u16 dport; offset:30; size:2; signed:0;
field:__u16 family; offset:32; size:2; signed:0;
field:__u8 saddr[4]; offset:34; size:4; signed:0;
field:__u8 daddr[4]; offset:38; size:4; signed:0;
field:__u8 saddr_v6[16]; offset:42; size:16; signed:0;
field:__u8 daddr_v6[16]; offset:58; size:16; signed:0;
field:int err; offset:76; size:4; signed:1;
print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s err=%d"
Co-developed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250721111607626_BDnIJB0ywk6FghN63bor@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a helper tcp_sock_set_maxseg() to directly set the TCP_MAXSEG
sockopt from kernel space.
This new helper will be used in the following patch from MPTCP.
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250719-net-next-mptcp-tcp_maxseg-v2-2-8c910fbc5307@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit cc34acd577f1 ("docs: net: document new locking reality")
introduced netif_ vs dev_ function semantics: the former expects locked
netdev, the latter takes care of the locking. We don't strictly
follow this semantics on either side, but there are more dev_xxx handlers
now that don't fit. Rename them to netif_xxx where appropriate.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets
dropped due to memory pressure. In production environments, we've observed
memory exhaustion reported by memory layer stack traces, but these drops
were not properly tracked in the SKB drop reason infrastructure.
While most network code paths now properly report pfmemalloc drops, some
protocol-specific socket implementations still use sk_filter() without
drop reason tracking:
- Bluetooth L2CAP sockets
- CAIF sockets
- IUCV sockets
- Netlink sockets
- SCTP sockets
- Unix domain sockets
These remaining cases represent less common paths and could be converted
in a follow-up patch if needed. The current implementation provides
significantly improved observability into memory pressure events in the
network stack, especially for key protocols like TCP and UDP, helping to
diagnose problems in production environments.
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-07-17
We've added 13 non-merge commits during the last 20 day(s) which contain
a total of 4 files changed, 712 insertions(+), 84 deletions(-).
The main changes are:
1) Avoid skipping or repeating a sk when using a TCP bpf_iter,
from Jordan Rife.
2) Clarify the driver requirement on using the XDP metadata,
from Song Yoong Siang
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
doc: xdp: Clarify driver implementation for XDP Rx metadata
selftests/bpf: Add tests for bucket resume logic in established sockets
selftests/bpf: Create iter_tcp_destroy test program
selftests/bpf: Create established sockets in socket iterator tests
selftests/bpf: Make ehash buckets configurable in socket iterator tests
selftests/bpf: Allow for iteration over multiple states
selftests/bpf: Allow for iteration over multiple ports
selftests/bpf: Add tests for bucket resume logic in listening sockets
bpf: tcp: Avoid socket skips and repeats during iteration
bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items
bpf: tcp: Get rid of st_bucket_done
bpf: tcp: Make sure iter->batch always contains a full bucket snapshot
bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch
====================
Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
neigh_add() updates pneigh_entry() found or created by pneigh_create().
This update is serialised by RTNL, but we will remove it.
Let's move the update part to pneigh_create() and make it return errno
instead of a pointer of pneigh_entry.
Now, the pneigh code is RTNL free.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which
is confusing.
When called with the last argument, creat, 0, pneigh_lookup() literally
looks up a proxy neighbour entry. This is the case of the reader path
as the fast path and RTM_GETNEIGH.
pneigh_lookup(), however, creates a pneigh_entry when called with creat 1
from RTM_NEWNEIGH and SIOCSARP, which require RTNL.
Let's split pneigh_lookup() into two functions.
We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock)
in the new pneigh_lookup() will be dropped.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.16-rc7).
Conflicts:
Documentation/netlink/specs/ovpn.yaml
880d43ca9aa4 ("netlink: specs: clean up spaces in brackets")
af52020fc599 ("ovpn: reject unexpected netlink attributes")
drivers/net/phy/phy_device.c
a44312d58e78 ("net: phy: Don't register LEDs for genphy")
f0f2b992d818 ("net: phy: Don't register LEDs for genphy")
https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org
drivers/net/wireless/intel/iwlwifi/fw/regulatory.c
drivers/net/wireless/intel/iwlwifi/mld/regulatory.c
5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap")
ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware")
net/ipv6/mcast.c
ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()")
a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()")
https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since "net: gro: use cb instead of skb->network_header", the skb network
header is no longer set in the GRO path.
This breaks fraglist segmentation, which relies on ip_hdr()/tcp_hdr()
to check for address/port changes.
Fix this regression by selectively setting the network header for merged
segment skbs.
Fixes: 186b1ea73ad8 ("net: gro: use cb instead of skb->network_header")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250705150622.10699-1-nbd@nbd.name
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The CI reported a UaF in tcp_prune_ofo_queue():
BUG: KASAN: slab-use-after-free in tcp_prune_ofo_queue+0x55d/0x660
Read of size 4 at addr ffff8880134729d8 by task socat/20348
CPU: 0 UID: 0 PID: 20348 Comm: socat Not tainted 6.16.0-rc5-virtme #1 PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x82/0xd0
print_address_description.constprop.0+0x2c/0x400
print_report+0xb4/0x270
kasan_report+0xca/0x100
tcp_prune_ofo_queue+0x55d/0x660
tcp_try_rmem_schedule+0x855/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcf73ef2337
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffd4f924708 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf73ef2337
RDX: 0000000000002000 RSI: 0000555f11d1a000 RDI: 0000000000000008
RBP: 0000555f11d1a000 R08: 0000000000002000 R09: 0000000000000000
R10: 0000000000000040 R11: 0000000000000246 R12: 0000000000000008
R13: 0000000000002000 R14: 0000555ee1a44570 R15: 0000000000002000
</TASK>
Allocated by task 20348:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
__kasan_slab_alloc+0x59/0x70
kmem_cache_alloc_node_noprof+0x110/0x340
__alloc_skb+0x213/0x2e0
tcp_collapse+0x43f/0xff0
tcp_try_rmem_schedule+0x6b9/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 20348:
kasan_save_stack+0x24/0x50
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x38/0x50
kmem_cache_free+0x149/0x330
tcp_prune_ofo_queue+0x211/0x660
tcp_try_rmem_schedule+0x855/0x12e0
tcp_data_queue+0x4dd/0x2260
tcp_rcv_established+0x5e8/0x2370
tcp_v4_do_rcv+0x4ba/0x8c0
__release_sock+0x27a/0x390
release_sock+0x53/0x1d0
tcp_sendmsg+0x37/0x50
sock_write_iter+0x3c1/0x520
vfs_write+0xc09/0x1210
ksys_write+0x183/0x1d0
do_syscall_64+0xc1/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff888013472900
which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 216 bytes inside of
freed 232-byte region [ffff888013472900, ffff8880134729e8)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13472
head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x80000000000040(head|node=0|zone=1)
page_type: f5(slab)
raw: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290
raw: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000
head: 0080000000000040 ffff88800198fb40 ffffea0000347b10 ffffea00004f5290
head: 0000000000000000 0000000000120012 00000000f5000000 0000000000000000
head: 0080000000000001 ffffea00004d1c81 00000000ffffffff 00000000ffffffff
head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888013472880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888013472900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888013472980: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
^
ffff888013472a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888013472a80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
Indeed tcp_prune_ofo_queue() is reusing the skb dropped a few lines
above. The caller wants to enqueue 'in_skb', lets check space vs the
latter.
Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: syzbot+865aca08c0533171bf6a@syzkaller.appspotmail.com
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/b78d2d9bdccca29021eed9a0e7097dd8dc00f485.1752567053.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, TCP stack accepts incoming packet if sizes of receive queues
are below sk->sk_rcvbuf limit.
This can cause memory overshoot if the packet is big, like an 1/2 MB
BIG TCP one.
Refine the check to take into account the incoming skb truesize.
Note that we still accept the packet if the receive queue is empty,
to not completely freeze TCP flows in pathological conditions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
These functions to not modify the skb, add a const qualifier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_measure_rcv_mss() is used to update icsk->icsk_ack.rcv_mss
(tcpi_rcv_mss in tcp_info) and tp->scaling_ratio.
Calling it from tcp_data_queue_ofo() makes sure these
fields are updated, and permits a better tuning
of sk->sk_rcvbuf, in the case a new flow receives many ooo
packets.
Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW
Incremented when an incoming packet is received beyond the
receiver window.
nstat -az | grep TcpExtBeyondWindow
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, TCP accepts incoming packets which might go beyond the
offered RWIN.
Add to tcp_sequence() the validation of packet end sequence.
Add the corresponding check in the fast path.
We relax this new constraint if the receive queue is empty,
to not freeze flows from buggy peers.
Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250711114006.480026-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, __mkroute_output overrules the MTU value configured for
broadcast routes.
This buggy behaviour can be reproduced with:
ip link set dev eth1 mtu 9000
ip route del broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2
ip route add broadcast 192.168.0.255 dev eth1 proto kernel scope link src 192.168.0.2 mtu 1500
The maximum packet size should be 1500, but it is actually 8000:
ping -b 192.168.0.255 -s 8000
Fix __mkroute_output to allow MTU values to be configured for
for broadcast routes (to support a mixed-MTU local-area-network).
Signed-off-by: Oscar Maes <oscmaes92@gmail.com>
Link: https://patch.msgid.link/20250710142714.12986-1-oscmaes92@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace the offset-based approach for tracking progress through a bucket
in the TCP table with one based on socket cookies. Remember the cookies
of unprocessed sockets from the last batch and use this list to
pick up where we left off or, in the case that the next socket
disappears between reads, find the first socket after that point that
still exists in the bucket and resume from there.
This approach guarantees that all sockets that existed when iteration
began and continue to exist throughout will be visited exactly once.
Sockets that are added to the table during iteration may or may not be
seen, but if they are they will be seen exactly once.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
|
|
Prepare for the next patch that tracks cookies between iterations by
converting struct sock **batch to union bpf_tcp_iter_batch_item *batch
inside struct bpf_tcp_iter_state.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
|
|
Get rid of the st_bucket_done field to simplify TCP iterator state and
logic. Before, st_bucket_done could be false if bpf_iter_tcp_batch
returned a partial batch; however, with the last patch ("bpf: tcp: Make
sure iter->batch always contains a full bucket snapshot"),
st_bucket_done == true is equivalent to iter->cur_sk == iter->end_sk.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
|
|
Require that iter->batch always contains a full bucket snapshot. This
invariant is important to avoid skipping or repeating sockets during
iteration when combined with the next few patches. Before, there were
two cases where a call to bpf_iter_tcp_batch may only capture part of a
bucket:
1. When bpf_iter_tcp_realloc_batch() returns -ENOMEM.
2. When more sockets are added to the bucket while calling
bpf_iter_tcp_realloc_batch(), making the updated batch size
insufficient.
In cases where the batch size only covers part of a bucket, it is
possible to forget which sockets were already visited, especially if we
have to process a bucket in more than two batches. This forces us to
choose between repeating or skipping sockets, so don't allow this:
1. Stop iteration and propagate -ENOMEM up to userspace if reallocation
fails instead of continuing with a partial batch.
2. Try bpf_iter_tcp_realloc_batch() with GFP_USER just as before, but if
we still aren't able to capture the full bucket, call
bpf_iter_tcp_realloc_batch() again while holding the bucket lock to
guarantee the bucket does not change. On the second attempt use
GFP_NOWAIT since we hold onto the spin lock.
I did some manual testing to exercise the code paths where GFP_NOWAIT is
used and where ERR_PTR(err) is returned. I used the realloc test cases
included later in this series to trigger a scenario where a realloc
happens inside bpf_iter_tcp_batch and made a small code tweak to force
the first realloc attempt to allocate a too-small batch, thus requiring
another attempt with GFP_NOWAIT. Some printks showed both reallocs with
the tests passing:
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
Jun 27 00:00:53 crow kernel: again GFP_USER
Jun 27 00:00:53 crow kernel: again GFP_NOWAIT
With this setup, I also forced each of the bpf_iter_tcp_realloc_batch
calls to return -ENOMEM to ensure that iteration ends and that the
read() in userspace fails.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
|
|
Prepare for the next patch which needs to be able to choose either
GFP_USER or GFP_NOWAIT for calls to bpf_iter_tcp_realloc_batch.
Signed-off-by: Jordan Rife <jordan@jrife.io>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
|
|
Cross-merge networking fixes after downstream PR (net-6.16-rc6).
No conflicts.
Adjacent changes:
Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml
0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible")
b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)
-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)
While here, manually convert a couple timeouts denominated in seconds
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a passive flow has not been accepted yet, it is
not wise to increase sk_rcvbuf when receiving ooo packets.
A very busy server might tune down tcp_rmem[1] to better
control how much memory can be used by sockets waiting
in its listeners accept queues.
Fixes: 63ad7dfedfae ("tcp: adjust rcvbuf in presence of reorders")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250707213900.1543248-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use DEFINE_MUTEX() to initialize udp_tunnel_gro_type_lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250707091634.311974-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As RACK-TLP was published as a standards-track RFC8985,
so the outdated ref draft-ietf-tcpm-rack need to be updated.
Signed-off-by: Xin Guo <guoxin0309@gmail.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250705163647.301231-1-guoxin0309@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since its introduction in commit 2e910b95329c ("net: Add a function to
splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter()
never used the @gfp argument. Remove it and adapt callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Syzkaller reported a bug [1] where sk->sk_forward_alloc can overflow.
When we send data, if an skb exists at the tail of the write queue, the
kernel will attempt to append the new data to that skb. However, the code
that checks for available space in the skb is flawed:
'''
copy = size_goal - skb->len
'''
The types of the variables involved are:
'''
copy: ssize_t (s64 on 64-bit systems)
size_goal: int
skb->len: unsigned int
'''
Due to C's type promotion rules, the signed size_goal is converted to an
unsigned int to match skb->len before the subtraction. The result is an
unsigned int.
When this unsigned int result is then assigned to the s64 copy variable,
it is zero-extended, preserving its non-negative value. Consequently, copy
is always >= 0.
Assume we are sending 2GB of data and size_goal has been adjusted to a
value smaller than skb->len. The subtraction will result in copy holding a
very large positive integer. In the subsequent logic, this large value is
used to update sk->sk_forward_alloc, which can easily cause it to overflow.
The syzkaller reproducer uses TCP_REPAIR to reliably create this
condition. However, this can also occur in real-world scenarios. The
tcp_bound_to_half_wnd() function can also reduce size_goal to a small
value. This would cause the subsequent tcp_wmem_schedule() to set
sk->sk_forward_alloc to a value close to INT_MAX. Further memory
allocation requests would then cause sk_forward_alloc to wrap around and
become negative.
[1]: https://syzkaller.appspot.com/bug?extid=de6565462ab540f50e47
Reported-by: syzbot+de6565462ab540f50e47@syzkaller.appspotmail.com
Fixes: 270a1c3de47e ("tcp: Support MSG_SPLICE_PAGES")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250707054112.101081-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The ipcomp fallback tunnels currently get deleted (from the various
lists and hashtables) as the last user state that needed that fallback
is destroyed (not deleted). If a reference to that user state still
exists, the fallback state will remain on the hashtables/lists,
triggering the WARN in xfrm_state_fini. Because of those remaining
references, the fix in commit f75a2804da39 ("xfrm: destroy xfrm_state
synchronously on net exit path") is not complete.
We recently fixed one such situation in TCP due to defered freeing of
skbs (commit 9b6412e6979f ("tcp: drop secpath at the same time as we
currently drop dst")). This can also happen due to IP reassembly: skbs
with a secpath remain on the reassembly queue until netns
destruction. If we can't guarantee that the queues are flushed by the
time xfrm_state_fini runs, there may still be references to a (user)
xfrm_state, preventing the timely deletion of the corresponding
fallback state.
Instead of chasing each instance of skbs holding a secpath one by one,
this patch fixes the issue directly within xfrm, by deleting the
fallback state as soon as the last user state depending on it has been
deleted. Destruction will still happen when the final reference is
dropped.
A separate lockdep class for the fallback state is required since
we're going to lock x->tunnel while x is locked.
Fixes: 9d4139c76905 ("netns xfrm: per-netns xfrm_state_all list")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Cross-merge networking fixes after downstream PR (net-6.16-rc5).
No conflicts.
No adjacent changes.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.
This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.
To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.
This is analogous to existing behavior of skb_clear_delivery_time.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
udp_v4_early_demux now returns drop reasons as it either returns 0 or
ip_mc_validate_source, which returns itself a drop reason. However its
use was not converted in ip_rcv_finish_core and the drop reason is
ignored, leading to potentially skipping increasing LINUX_MIB_IPRPFILTER
if the drop reason is SKB_DROP_REASON_IP_RPFILTER.
This is a fix and we're not converting udp_v4_early_demux to explicitly
return a drop reason to ease backports; this can be done as a follow-up.
Fixes: d46f827016d8 ("net: ip: make ip_mc_validate_source() return drop reason")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250701074935.144134-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the new helpers as a first step to deal with
potential dst->dev races.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need RCU protection in the future.
Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())
Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.
We will likely need full RCU protection later.
Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
(dst_entry)->expires is read and written locklessly,
add corresponding annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
(dst_entry)->obsolete is read locklessly, add corresponding
annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.
Move tcp_memory_allocated into a dedicated cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The referenced commit replaced a call to __xfrm4|6_udp_encap_rcv() with
a custom check for non-ESP markers. But what the called function also
did was setting the transport header to the ESP header. The function
that follows, esp4|6_gro_receive(), relies on that being set when it calls
xfrm_parse_spi(). We have to set the full offset as the skb's head was
not moved yet so adding just the UDP header length won't work.
Fixes: e3fd05777685 ("xfrm: Fix UDP GRO handling for some corner cases")
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
syzbot found at least one path leads to an ip_mr_output()
without RCU being held.
Add guard(rcu)() to fix this in a concise way.
WARNING: CPU: 0 PID: 0 at net/ipv4/ipmr.c:2302 ip_mr_output+0xbb1/0xe70 net/ipv4/ipmr.c:2302
Call Trace:
<IRQ>
igmp_send_report+0x89e/0xdb0 net/ipv4/igmp.c:799
igmp_timer_expire+0x204/0x510 net/ipv4/igmp.c:-1
call_timer_fn+0x17e/0x5f0 kernel/time/timer.c:1747
expire_timers kernel/time/timer.c:1798 [inline]
__run_timers kernel/time/timer.c:2372 [inline]
__run_timer_base+0x61a/0x860 kernel/time/timer.c:2384
run_timer_base kernel/time/timer.c:2393 [inline]
run_timer_softirq+0xb7/0x180 kernel/time/timer.c:2403
handle_softirqs+0x286/0x870 kernel/softirq.c:579
__do_softirq kernel/softirq.c:613 [inline]
invoke_softirq kernel/softirq.c:453 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:680
irq_exit_rcu+0x9/0x30 kernel/softirq.c:696
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1050 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1050
Fixes: 35bec72a24ac ("net: ipv4: Add ip_mr_output()")
Reported-by: syzbot+f02fb9e43bd85c6c66ae@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/685e841a.a00a0220.129264.0002.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Petr Machata <petrm@nvidia.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Benjamin Poirier <bpoirier@nvidia.com>
Cc: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If the new coming segment covers more than one skbs in the ofo queue,
and which seq is equal to rcv_nxt, then the sequence range
that is duplicated will be sent as DUP SACK, the detail as below,
in step6, the {501,2001} range is clearly including too much
DUP SACK range, in violation of RFC 2883 rules.
1. client > server: Flags [.], seq 501:1001, ack 1325288529, win 20000, length 500
2. server > client: Flags [.], ack 1, [nop,nop,sack 1 {501:1001}], length 0
3. client > server: Flags [.], seq 1501:2001, ack 1325288529, win 20000, length 500
4. server > client: Flags [.], ack 1, [nop,nop,sack 2 {1501:2001} {501:1001}], length 0
5. client > server: Flags [.], seq 1:2001, ack 1325288529, win 20000, length 2000
6. server > client: Flags [.], ack 2001, [nop,nop,sack 1 {501:2001}], length 0
After this fix, the final ACK is as below:
6. server > client: Flags [.], ack 2001, options [nop,nop,sack 1 {501:1001}], length 0
[edumazet] added a new packetdrill test in the following patch.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: xin.guo <guoxin0309@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250626123420.1933835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
inet_rtx_syn_ack() is a simple wrapper around tcp_rtx_synack(),
if we move req->num_retrans update.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now inet_rtx_syn_ack() is only used by TCP, it can directly
call tcp_rtx_synack() instead of using an indirect call
to req->rsk_ops->rtx_syn_ack().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250626153017.2156274-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|