summaryrefslogtreecommitdiff
path: root/net/ipv4/ip_output.c
AgeCommit message (Collapse)Author
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-02ipv4: Fix uninit-value access in __ip_make_skb()Shigeru Yoshida
KMSAN reported uninit-value access in __ip_make_skb() [1]. __ip_make_skb() tests HDRINCL to know if the skb has icmphdr. However, HDRINCL can cause a race condition. If calling setsockopt(2) with IP_HDRINCL changes HDRINCL while __ip_make_skb() is running, the function will access icmphdr in the skb even if it is not included. This causes the issue reported by KMSAN. Check FLOWI_FLAG_KNOWN_NH on fl4->flowi4_flags instead of testing HDRINCL on the socket. Also, fl4->fl4_icmp_type and fl4->fl4_icmp_code are not initialized. These are union in struct flowi4 and are implicitly initialized by flowi4_init_output(), but we should not rely on specific union layout. Initialize these explicitly in raw_sendmsg(). [1] BUG: KMSAN: uninit-value in __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 __ip_make_skb+0x2b74/0x2d20 net/ipv4/ip_output.c:1481 ip_finish_skb include/net/ip.h:243 [inline] ip_push_pending_frames+0x4c/0x5c0 net/ipv4/ip_output.c:1508 raw_sendmsg+0x2381/0x2690 net/ipv4/raw.c:654 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: slab_post_alloc_hook mm/slub.c:3804 [inline] slab_alloc_node mm/slub.c:3845 [inline] kmem_cache_alloc_node+0x5f6/0xc50 mm/slub.c:3888 kmalloc_reserve+0x13c/0x4a0 net/core/skbuff.c:577 __alloc_skb+0x35a/0x7c0 net/core/skbuff.c:668 alloc_skb include/linux/skbuff.h:1318 [inline] __ip_append_data+0x49ab/0x68c0 net/ipv4/ip_output.c:1128 ip_append_data+0x1e7/0x260 net/ipv4/ip_output.c:1365 raw_sendmsg+0x22b1/0x2690 net/ipv4/raw.c:648 inet_sendmsg+0x27b/0x2a0 net/ipv4/af_inet.c:851 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x274/0x3c0 net/socket.c:745 __sys_sendto+0x62c/0x7b0 net/socket.c:2191 __do_sys_sendto net/socket.c:2203 [inline] __se_sys_sendto net/socket.c:2199 [inline] __x64_sys_sendto+0x130/0x200 net/socket.c:2199 do_syscall_64+0xd8/0x1f0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6d/0x75 CPU: 1 PID: 15709 Comm: syz-executor.7 Not tainted 6.8.0-11567-gb3603fcb79b1 #25 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 Fixes: 99e5acae193e ("ipv4: Fix potential uninit variable access bug in __ip_make_skb()") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://lore.kernel.org/r/20240430123945.2057348-1-syoshida@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-04-30inet: introduce dst_rtable() helperEric Dumazet
I added dst_rt6_info() in commit e8dfd42c17fa ("ipv6: introduce dst_rt6_info() helper") This patch does a similar change for IPv4. Instead of (struct rtable *)dst casts, we can use : #define dst_rtable(_ptr) \ container_of_const(_ptr, struct rtable, dst) Patch is smaller than IPv6 one, because IPv4 has skb_rtable() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240429133009.1227754-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-18Revert "net: Re-use and set mono_delivery_time bit for userspace tstamp packets"Abhishek Chauhan
This reverts commit 885c36e59f46375c138de18ff1692f18eff67b7f. The patch currently broke the bpf selftest test_tc_dtime because uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which does not necessarily mean mono with the original fix as the bit was re-used for userspace timestamp as well to avoid tstamp reset in the forwarding path. To solve this we need to keep mono_delivery_time as is and introduce another bit called user_delivery_time and fall back to the initial proposal of setting the user_delivery_time bit based on sk_clockid set from userspace. Fixes: 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets") Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/ Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-05net: Re-use and set mono_delivery_time bit for userspace tstamp packetsAbhishek Chauhan
Bridge driver today has no support to forward the userspace timestamp packets and ends up resetting the timestamp. ETF qdisc checks the packet coming from userspace and encounters to be 0 thereby dropping time sensitive packets. These changes will allow userspace timestamps packets to be forwarded from the bridge to NIC drivers. Setting the same bit (mono_delivery_time) to avoid dropping of userspace tstamp packets in the forwarding path. Existing functionality of mono_delivery_time remains unaltered here, instead just extended with userspace tstamp support for bridge forwarding path. Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240301201348.2815102-1-quic_abchauha@quicinc.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-15net-timestamp: make sk_tskey more predictable in error pathVadim Fedorenko
When SOF_TIMESTAMPING_OPT_ID is used to ambiguate timestamped datagrams, the sk_tskey can become unpredictable in case of any error happened during sendmsg(). Move increment later in the code and make decrement of sk_tskey in error path. This solution is still racy in case of multiple threads doing snedmsg() over the very same socket in parallel, but still makes error path much more predictable. Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20240213110428.1681540-1-vadfed@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-12ipv4: Set the routing scope properly in ip_route_output_ports().Guillaume Nault
Set scope automatically in ip_route_output_ports() (using the socket SOCK_LOCALROUTE flag). This way, callers don't have to overload the tos with the RTO_ONLINK flag, like RT_CONN_FLAGS() does. For callers that don't pass a struct sock, this doesn't change anything as the scope is still set to RT_SCOPE_UNIVERSE when sk is NULL. Callers that passed a struct sock and used RT_CONN_FLAGS(sk) or RT_CONN_FLAGS_TOS(sk, tos) for the tos are modified to use ip_sock_tos(sk) and RT_TOS(tos) respectively, as overloading tos with the RTO_ONLINK flag now becomes unnecessary. In drivers/net/amt.c, all ip_route_output_ports() calls use a 0 tos parameter, ignoring the SOCK_LOCALROUTE flag of the socket. But the sk parameter is a kernel socket, which doesn't have any configuration path for setting SOCK_LOCALROUTE anyway. Therefore, ip_route_output_ports() will continue to initialise scope with RT_SCOPE_UNIVERSE and amt.c doesn't need to be modified. Also, remove RT_CONN_FLAGS() and RT_CONN_FLAGS_TOS() from route.h as these macros are now unused. The objective is to eventually remove RTO_ONLINK entirely to allow converting ->flowi4_tos to dscp_t. This will ensure proper isolation between the DSCP and ECN bits, thus minimising the risk of introducing bugs where TOS values interfere with ECN. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/dacfd2ab40685e20959ab7b53c427595ba229e7d.1707496938.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-31net: ipv4: fix a memleak in ip_setup_corkZhipeng Lu
When inetdev_valid_mtu fails, cork->opt should be freed if it is allocated in ip_setup_cork. Otherwise there could be a memleak. Fixes: 501a90c94510 ("inet: protect against too small mtu values.") Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240129091017.2938835-1-alexious@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-20net: fix IPSTATS_MIB_OUTPKGS increment in OutForwDatagrams.Heng Guo
Reproduce environment: network with 3 VM linuxs is connected as below: VM1<---->VM2(latest kernel 6.5.0-rc7)<---->VM3 VM1: eth0 ip: 192.168.122.207 MTU 1500 VM2: eth0 ip: 192.168.122.208, eth1 ip: 192.168.123.224 MTU 1500 VM3: eth0 ip: 192.168.123.240 MTU 1500 Reproduce: VM1 send 1400 bytes UDP data to VM3 using tools scapy with flags=0. scapy command: send(IP(dst="192.168.123.240",flags=0)/UDP()/str('0'*1400),count=1, inter=1.000000) Result: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 11 0 3 4 0 0 4 7 0 0 0 0 0 0 0 0 0 ...... ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 12 0 3 5 0 0 4 8 0 0 0 0 0 0 0 0 0 ...... ---------------------------------------------------------------------- "ForwDatagrams" increase from 4 to 5 and "OutRequests" also increase from 7 to 8. Issue description and patch: IPSTATS_MIB_OUTPKTS("OutRequests") is counted with IPSTATS_MIB_OUTOCTETS ("OutOctets") in ip_finish_output2(). According to RFC 4293, it is "OutOctets" counted with "OutTransmits" but not "OutRequests". "OutRequests" does not include any datagrams counted in "ForwDatagrams". ipSystemStatsOutOctets OBJECT-TYPE DESCRIPTION "The total number of octets in IP datagrams delivered to the lower layers for transmission. Octets from datagrams counted in ipIfStatsOutTransmits MUST be counted here. ipSystemStatsOutRequests OBJECT-TYPE DESCRIPTION "The total number of IP datagrams that local IP user- protocols (including ICMP) supplied to IP in requests for transmission. Note that this counter does not include any datagrams counted in ipSystemStatsOutForwDatagrams. So do patch to define IPSTATS_MIB_OUTPKTS to "OutTransmits" and add IPSTATS_MIB_OUTREQUESTS for "OutRequests". Add IPSTATS_MIB_OUTREQUESTS counter in __ip_local_out() for ipv4 and add IPSTATS_MIB_OUT counter in ip6_finish_output2() for ipv6. Test result with patch: Before IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates OutTransmits Ip: 1 64 9 0 5 1 0 0 3 3 0 0 0 0 0 0 0 0 0 4 ...... root@qemux86-64:~# cat /proc/net/netstat ...... IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts ReasmOverlaps IpExt: 0 0 0 0 0 0 2976 1896 0 0 0 0 0 9 0 0 0 0 ---------------------------------------------------------------------- After IP data is sent. ---------------------------------------------------------------------- root@qemux86-64:~# cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates OutTransmits Ip: 1 64 10 0 5 2 0 0 3 3 0 0 0 0 0 0 0 0 0 5 ...... root@qemux86-64:~# cat /proc/net/netstat ...... IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts ReasmOverlaps IpExt: 0 0 0 0 0 0 4404 3324 0 0 0 0 0 10 0 0 0 0 ---------------------------------------------------------------------- "ForwDatagrams" increase from 1 to 2 and "OutRequests" is keeping 3. "OutTransmits" increase from 4 to 5 and "OutOctets" increase 1428. Signed-off-by: Heng Guo <heng.guo@windriver.com> Reviewed-by: Kun Song <Kun.Song@windriver.com> Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_TOSEric Dumazet
Some reads of inet->tos are racy. Add needed READ_ONCE() annotations and convert IP_TOS option lockless. v2: missing changes in include/net/route.h (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MTU_DISCOVEREric Dumazet
inet->pmtudisc can be read locklessly. Implement proper lockless reads and writes to inet->pmtudisc ip_sock_set_mtu_discover() can now be called from arbitrary contexts. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01inet: implement lockless IP_MULTICAST_TTLEric Dumazet
inet->mc_ttl can be read locklessly. Implement proper lockless reads and writes to inet->mc_ttl Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: implement lockless SO_PRIORITYEric Dumazet
This is a followup of 8bf43be799d4 ("net: annotate data-races around sk->sk_priority"). sk->sk_priority can be read and written without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-01net: annotate data-races around sk->sk_tsflagsEric Dumazet
sk->sk_tsflags can be read locklessly, add corresponding annotations. Fixes: b9f40e21ef42 ("net-timestamp: move timestamp flags out of sk_flags") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-30net: ipv4, ipv6: fix IPSTATS_MIB_OUTOCTETS increment duplicatedHeng Guo
commit edf391ff1723 ("snmp: add missing counters for RFC 4293") had already added OutOctets for RFC 4293. In commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams"), OutOctets was counted again, but not removed from ip_output(). According to RFC 4293 "3.2.3. IP Statistics Tables", ipipIfStatsOutTransmits is not equal to ipIfStatsOutForwDatagrams. So "IPSTATS_MIB_OUTOCTETS must be incremented when incrementing" is not accurate. And IPSTATS_MIB_OUTOCTETS should be counted after fragment. This patch reverts commit 2d8dbb04c63e ("snmp: fix OutOctets counter to include forwarded datagrams") and move IPSTATS_MIB_OUTOCTETS to ip_finish_output2 for ipv4. Reviewed-by: Filip Pudak <filip.pudak@windriver.com> Signed-off-by: Heng Guo <heng.guo@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-18lwt: Check LWTUNNEL_XMIT_CONTINUE strictlyYan Zhai
LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2, such that any positive return value from a xmit hook could cause unexpected continue behavior, despite that related skb may have been freed. This could be error-prone for future xmit hook ops. One of the possible errors is to return statuses of dst_output directly. To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to distinguish from dst_output statuses and check the continue condition explicitly. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com
2023-08-16inet: implement lockless IP_TTLEric Dumazet
ip_select_ttl() is racy, because it reads inet->uc_ttl without proper locking. Add READ_ONCE()/WRITE_ONCE() annotations while allowing IP_TTL socket option to be set/read without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->hdrincl to inet->inet_flagsEric Dumazet
IP_HDRINCL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-08net/ipv4: return the real errno instead of -EINVALxu xin
For now, No matter what error pointer ip_neigh_for_gw() returns, ip_finish_output2() always return -EINVAL, which may mislead the upper users. For exemple, an application uses sendto to send an UDP packet, but when the neighbor table overflows, sendto() will get a value of -EINVAL, and it will cause users to waste a lot of time checking parameters for errors. Return the real errno instead of -EINVAL. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Si Hao <si.hao@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20230807015408.248237-1-xu.xin16@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGESDavid Howells
__ip_append_data() can get into an infinite loop when asked to splice into a partially-built UDP message that has more than the frag-limit data and up to the MTU limit. Something like: pipe(pfd); sfd = socket(AF_INET, SOCK_DGRAM, 0); connect(sfd, ...); send(sfd, buffer, 8161, MSG_CONFIRM|MSG_MORE); write(pfd[1], buffer, 8); splice(pfd[0], 0, sfd, 0, 0x4ffe0ul, 0); where the amount of data given to send() is dependent on the MTU size (in this instance an interface with an MTU of 8192). The problem is that the calculation of the amount to copy in __ip_append_data() goes negative in two places, and, in the second place, this gets subtracted from the length remaining, thereby increasing it. This happens when pagedlen > 0 (which happens for MSG_ZEROCOPY and MSG_SPLICE_PAGES), because the terms in: copy = datalen - transhdrlen - fraggap - pagedlen; then mostly cancel when pagedlen is substituted for, leaving just -fraggap. This causes: length -= copy + transhdrlen; to increase the length to more than the amount of data in msg->msg_iter, which causes skb_splice_from_iter() to be unable to fill the request and it returns less than 'copied' - which means that length never gets to 0 and we never exit the loop. Fix this by: (1) Insert a note about the dodgy calculation of 'copy'. (2) If MSG_SPLICE_PAGES, clear copy if it is negative from the above equation, so that 'offset' isn't regressed and 'length' isn't increased, which will mean that length and thus copy should match the amount left in the iterator. (3) When handling MSG_SPLICE_PAGES, give a warning and return -EIO if we're asked to splice more than is in the iterator. It might be better to not give the warning or even just give a 'short' write. [!] Note that this ought to also affect MSG_ZEROCOPY, but MSG_ZEROCOPY avoids the problem by simply assuming that everything asked for got copied, not just the amount that was in the iterator. This is a potential bug for the future. Fixes: 7ac7c987850c ("udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES") Reported-by: syzbot+f527b971b4bdc8e79f9e@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000881d0606004541d1@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1420063.1690904933@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-29net: annotate data-races around sk->sk_priorityEric Dumazet
sk_getsockopt() runs locklessly. This means sk->sk_priority can be read while other threads are changing its value. Other reads also happen without socket lock being held. Add missing annotations where needed. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_markEric Dumazet
sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-16ip, ip6: Fix splice to raw and ping socketsDavid Howells
Splicing to SOCK_RAW sockets may set MSG_SPLICE_PAGES, but in such a case, __ip_append_data() will call skb_splice_from_iter() to access the 'from' data, assuming it to point to a msghdr struct with an iter, instead of using the provided getfrag function to access it. In the case of raw_sendmsg(), however, this is not the case and 'from' will point to a raw_frag_vec struct and raw_getfrag() will be the frag-getting function. A similar issue may occur with rawv6_sendmsg(). Fix this by ignoring MSG_SPLICE_PAGES if getfrag != ip_generic_getfrag as ip_generic_getfrag() expects "from" to be a msghdr*, but the other getfrags don't. Note that this will prevent MSG_SPLICE_PAGES from being effective for udplite. This likely affects ping sockets too. udplite looks like it should be okay as it expects "from" to be a msghdr. Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/000000000000ae4cbf05fdeb8349@google.com/ Fixes: 2dc334f1a63a ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()") Tested-by: syzbot+d8486855ef44506fd675@syzkaller.appspotmail.com cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1410156.1686729856@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10net: move gso declarations and functions to their own filesEric Dumazet
Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-25net: ipv4: use consistent txhash in TIME_WAIT and SYN_RECVAntoine Tenart
When using IPv4/TCP, skb->hash comes from sk->sk_txhash except in TIME_WAIT and SYN_RECV where it's not set in the reply skb from ip_send_unicast_reply. Those packets will have a mismatched hash with others from the same flow as their hashes will be 0. IPv6 does not have the same issue as the hash is set from the socket txhash in those cases. This commits sets the hash in the reply skb from ip_send_unicast_reply, which makes the IPv4 code behaving like IPv6. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-23ip: Remove ip_append_page()David Howells
ip_append_page() is no longer used with the removal of udp_sendpage(), so remove it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23ip, udp: Support MSG_SPLICE_PAGESDavid Howells
Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be spliced from the source iterator. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23net: Pass max frags into skb_append_pagefrags()David Howells
Pass the maximum number of fragments into skb_append_pagefrags() rather than using MAX_SKB_FRAGS so that it can be used from code that wants to specify sysctl_max_skb_frags. Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-22ipv4: Fix potential uninit variable access bug in __ip_make_skb()Ziyang Xuan
Like commit ea30388baebc ("ipv6: Fix an uninit variable access bug in __ip6_make_skb()"). icmphdr does not in skb linear region under the scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will trigger the uninit variable access bug. Use a local variable icmp_type to carry the correct value in different scenarios. Fixes: 96793b482540 ("[IPV4]: Add ICMPMsgStats MIB (RFC 4293)") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-21neighbour: switch to standard rcu, instead of rcu_bhEric Dumazet
rcu_bh is no longer a win, especially for objects freed with standard call_rcu(). Switch neighbour code to no longer disable BH when not necessary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17inet: preserve const qualifier in inet_sk()Eric Dumazet
We can change inet_sk() to propagate const qualifier of its argument. This should avoid some potential errors caused by accidental (const -> not_const) promotion. Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled in separate patch series. v2: use container_of_const() as advised by Jakub and Linus Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/ Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-08net-timestamp: extend SOF_TIMESTAMPING_OPT_ID to HW timestampsVadim Fedorenko
When the feature was added it was enabled for SW timestamps only but with current hardware the same out-of-order timestamps can be seen. Let's expand the area for the feature to all types of timestamps. Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-01net: add support for ipv4 big tcpXin Long
Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP. Firstly, allow sk->sk_gso_max_size to be set to a value greater than GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size() for IPv4 TCP sockets. Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU in __ip_local_out() to allow to send BIG TCP packets, and this implies that skb->len is the length of a IPv4 packet; On RX path, use skb->len as the length of the IPv4 packet when the IP header tot_len is 0 and skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only need to update these APIs. Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In GRO complete, set IP header tot_len to 0 when the merged packet size greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed on RX path. Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP packets. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-11treewide: use get_random_{u8,u16}() when possible, part 1Jason A. Donenfeld
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value, simply use the get_random_{u8,u16}() functions, which are faster than wasting the additional bytes from a 32-bit value. This was done mechanically with this coccinelle script: @@ expression E; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; typedef __be16; typedef __le16; typedef u8; @@ ( - (get_random_u32() & 0xffff) + get_random_u16() | - (get_random_u32() & 0xff) + get_random_u8() | - (get_random_u32() % 65536) + get_random_u16() | - (get_random_u32() % 256) + get_random_u8() | - (get_random_u32() >> 16) + get_random_u16() | - (get_random_u32() >> 24) + get_random_u8() | - (u16)get_random_u32() + get_random_u16() | - (u8)get_random_u32() + get_random_u8() | - (__be16)get_random_u32() + (__be16)get_random_u16() | - (__le16)get_random_u32() + (__le16)get_random_u16() | - prandom_u32_max(65536) + get_random_u16() | - prandom_u32_max(256) + get_random_u8() | - E->inet_id = get_random_u32() + E->inet_id = get_random_u16() ) @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; identifier v; @@ - u16 v = get_random_u32(); + u16 v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; identifier v; @@ - u8 v = get_random_u32(); + u8 v = get_random_u8(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; u16 v; @@ - v = get_random_u32(); + v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; u8 v; @@ - v = get_random_u32(); + v = get_random_u8(); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Examine limits @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value < 256: coccinelle.RESULT = cocci.make_ident("get_random_u8") elif value < 65536: coccinelle.RESULT = cocci.make_ident("get_random_u16") else: print("Skipping large mask of %s" % (literal)) cocci.include_match(False) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; identifier add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + (RESULT() & LITERAL) Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-28net: shrink struct ubuf_infoPavel Begunkov
We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-30net: unify alloclen calculation for paged requestsPavel Begunkov
Consolidate alloclen and pagedlen calculation for zerocopy and normal paged requests. The current non-zerocopy paged version can a bit overallocate and unnecessary copy a small chunk of data into the linear part. Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/netdev/CA+FuTSf0+cJ9_N_xrHmCGX_KoVCWcE0YQBdtgEkzGvcLMSv7Qw@mail.gmail.com/ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b0e4edb7b91f171c7119891d3c61040b8c56596e.1661428921.git.asml.silence@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-24net: Fix data-races around sysctl_[rw]mem_(max|default).Kuniyuki Iwashima
While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-19Merge branch 'io_uring-zerocopy-send' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux Pavel Begunkov says: ==================== io_uring zerocopy send The patchset implements io_uring zerocopy send. It works with both registered and normal buffers, mixing is allowed but not recommended. Apart from usual request completions, just as with MSG_ZEROCOPY, io_uring separately notifies the userspace when buffers are freed and can be reused (see API design below), which is delivered into io_uring's Completion Queue. Those "buffer-free" notifications are not necessarily per request, but the userspace has control over it and should explicitly attaching a number of requests to a single notification. The series also adds some internal optimisations when used with registered buffers like removing page referencing. From the kernel networking perspective there are two main changes. The first one is passing ubuf_info into the network layer from io_uring (inside of an in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info caching on the io_uring side, but also helps to avoid cross-referencing and synchronisation problems. The second part is an optional optimisation removing page referencing for requests with registered buffers. Benchmarking UDP with an optimised version of the selftest (see [1]), which sends a bunch of requests, waits for completions and repeats. "+ flush" column posts one additional "buffer-free" notification per request, and just "zc" doesn't post buffer notifications at all. NIC (requests / second): IO size | non-zc | zc | zc + flush 4000 | 495134 | 606420 (+22%) | 558971 (+12%) 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) dummy (requests / second): IO size | non-zc | zc | zc + flush 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) Previously it also brought a massive performance speedup compared to the msg_zerocopy tool (see [3]), which is probably not super interesting. There is also an additional bunch of refcounting optimisations that was omitted from the series for simplicity and as they don't change the picture drastically, they will be sent as follow up, as well as flushing optimisations closing the performance gap b/w two last columns. For TCP on localhost (with hacks enabling localhost zerocopy) and including additional overhead for receive: IO size | non-zc | zc 1200 | 4174 | 4148 4096 | 7597 | 11228 Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the omitted optimisations will somewhat help, should look better for 4000, but couldn't test properly because of setup problems. Links: liburing (benchmark + tests): [1] https://github.com/isilence/liburing/tree/zc_v4 kernel repo: [2] https://github.com/isilence/linux/tree/zc_v4 RFC v1: [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/ RFC v2: https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/ Net patches based: git@github.com:isilence/linux.git zc_v4-net-base or https://github.com/isilence/linux/tree/zc_v4-net-base API design overview: The series introduces an io_uring concept of notifactors. From the userspace perspective it's an entity to which it can bind one or more requests and then requesting to flush it. Flushing a notifier makes it impossible to attach new requests to it, and instructs the notifier to post a completion once all requests attached to it are completed and the kernel doesn't need the buffers anymore. Notifications are stored in notification slots, which should be registered as an array in io_uring. Each slot stores only one notifier at any particular moment. Flushing removes it from the slot and the slot automatically replaces it with a new notifier. All operations with notifiers are done by specifying an index of a slot it's currently in. When registering a notification the userspace specifies a u64 tag for each slot, which will be copied in notification completion entries as cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32 sequence number counting notifiers of a slot. ==================== Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19ipv4/udp: support externally provided ubufsPavel Begunkov
Teach ipv4/udp how to use external ubuf_info provided in msghdr and also prepare it for managed frags by sprinkling skb_zcopy_downgrade_managed() when it could mix managed and not managed frags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-18ipv4: avoid partial copy for zcPavel Begunkov
Even when zerocopy transmission is requested and possible, __ip_append_data() will still copy a small chunk of data just because it allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles on copy and iter manipulations and also misalignes potentially aligned data. Avoid such copies. And as a bonus we can allocate smaller skb. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-11net: Find dst with sk's xfrm policy not ctl_sksewookseo
If we set XFRM security policy by calling setsockopt with option IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock' struct. However tcp_v6_send_response doesn't look up dst_entry with the actual socket but looks up with tcp control socket. This may cause a problem that a RST packet is sent without ESP encryption & peer's TCP socket can't receive it. This patch will make the function look up dest_entry with actual socket, if the socket has XFRM policy(sock_policy), so that the TCP response packet via this function can be encrypted, & aligned on the encrypted TCP socket. Tested: We encountered this problem when a TCP socket which is encrypted in ESP transport mode encryption, receives challenge ACK at SYN_SENT state. After receiving challenge ACK, TCP needs to send RST to establish the socket at next SYN try. But the RST was not encrypted & peer TCP socket still remains on ESTABLISHED state. So we verified this with test step as below. [Test step] 1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED). 2. Client tries a new connection on the same TCP ports(src & dst). 3. Server will return challenge ACK instead of SYN,ACK. 4. Client will send RST to server to clear the SOCKET. 5. Client will retransmit SYN to server on the same TCP ports. [Expected result] The TCP connection should be established. Cc: Maciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Sehee Lee <seheele@google.com> Signed-off-by: Sewook Seo <sewookseo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-06-24net: helper function skb_len_addRichard Gobert
Move the len fields manipulation in the skbs to a helper function. There is a comment specifically requesting this and there are several other areas in the code displaying the same pattern which can be refactored. This improves code readability. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Link: https://lore.kernel.org/r/20220622160853.GA6478@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-03net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()Martin KaFai Lau
The previous patches handled the delivery_time before sch_handle_ingress(). This patch can now set the skb->mono_delivery_time to flag the skb->tstamp is used as the mono delivery_time (EDT) instead of the (rcv) timestamp and also clear it with skb_clear_delivery_time() after sch_handle_ingress(). This will make the bpf_redirect_*() to keep the mono delivery_time and used by a qdisc (fq) of the egress-ing interface. A latter patch will postpone the skb_clear_delivery_time() until the stack learns that the skb is being delivered locally and that will make other kernel forwarding paths (ip[6]_forward) able to keep the delivery_time also. Thus, like the previous patches on using the skb->mono_delivery_time bit, calling skb_clear_delivery_time() is not limited within the CONFIG_NET_INGRESS to avoid too many code churns among this set. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-03-03net: Add skb->mono_delivery_time to distinguish mono delivery_time from ↵Martin KaFai Lau
(rcv) timestamp skb->tstamp was first used as the (rcv) timestamp. The major usage is to report it to the user (e.g. SO_TIMESTAMP). Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP) during egress and used by the qdisc (e.g. sch_fq) to make decision on when the skb can be passed to the dev. Currently, there is no way to tell skb->tstamp having the (rcv) timestamp or the delivery_time, so it is always reset to 0 whenever forwarded between egress and ingress. While it makes sense to always clear the (rcv) timestamp in skb->tstamp to avoid confusing sch_fq that expects the delivery_time, it is a performance issue [0] to clear the delivery_time if the skb finally egress to a fq@phy-dev. For example, when forwarding from egress to ingress and then finally back to egress: tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns ^ ^ reset rest This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp is storing the mono delivery_time (EDT) instead of the (rcv) timestamp. The current use case is to keep the TCP mono delivery_time (EDT) and to be used with sch_fq. A latter patch will also allow tc-bpf@ingress to read and change the mono delivery_time. In the future, another bit (e.g. skb->user_delivery_time) can be added for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid. [ This patch is a prep work. The following patches will get the other parts of the stack ready first. Then another patch after that will finally set the skb->mono_delivery_time. ] skb_set_delivery_time() function is added. It is used by the tcp_output.c and during ip[6] fragmentation to assign the delivery_time to the skb->tstamp and also set the skb->mono_delivery_time. A note on the change in ip_send_unicast_reply() in ip_output.c. It is only used by TCP to send reset/ack out of a ctl_sk. Like the new skb_set_delivery_time(), this patch sets the skb->mono_delivery_time to 0 for now as a place holder. It will be enabled in a latter patch. A similar case in tcp_ipv6 can be done with skb_set_delivery_time() in tcp_v6_send_response(). [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-26net: ip: add skb drop reasons for ip egress pathMenglong Dong
Replace kfree_skb() which is used in the packet egress path of IP layer with kfree_skb_reason(). Functions that are involved include: __ip_queue_xmit() ip_finish_output() ip_mc_finish_output() ip6_output() ip6_finish_output() ip6_finish_output2() Following new drop reasons are introduced: SKB_DROP_REASON_IP_OUTNOROUTES SKB_DROP_REASON_BPF_CGROUP_EGRESS SKB_DROP_REASON_IPV6DISABLED SKB_DROP_REASON_NEIGH_CREATEFAIL Reviewed-by: Mengen Sun <mengensun@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-18net-timestamp: convert sk->sk_tskey to atomic_tEric Dumazet
UDP sendmsg() can be lockless, this is causing all kinds of data races. This patch converts sk->sk_tskey to remove one of these races. BUG: KCSAN: data-race in __ip_append_data / __ip_append_data read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1: __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0: __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000054d -> 0x0000054e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85fa6f-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-29ipv4: drop fragmentation code from ip_options_build()Jakub Kicinski
Since v2.5.44 and addition of ip_options_fragment() ip_options_build() does not render headers for fragments directly. @is_frag is always 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>