summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-04-25tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWATEric Dumazet
I had this bug sitting for too long in my pile, it is time to fix it. Thanks to Doug Porter for reminding me of it! We had various attempts in the past, including commit 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"), but the issue is that TCP stack currently only generates EPOLLOUT from input path, when tp->snd_una has advanced and skb(s) cleaned from rtx queue. If a flow has a big RTT, and/or receives SACKs, it is possible that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0 and no more data can be sent until tp->snd_una finally advances. What is needed is to also check if POLLOUT needs to be generated whenever tp->snd_nxt is advanced, from output path. This bug triggers more often after an idle period, as we do not receive ACK for at least one RTT. tcp_notsent_lowat could be a fraction of what CWND and pacing rate would allow to send during this RTT. In a followup patch, I will remove the bogus call to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED) from tcp_check_space(). Fact that we have decided to generate an EPOLLOUT does not mean the application has immediately refilled the transmit queue. This optimistic call might have been the reason the bug seemed not too serious. Tested: 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2] $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat $ cat bench_rr.sh SUM=0 for i in {1..10} do V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"` echo $V SUM=$(($SUM + $V)) done echo SUM=$SUM Before patch: $ bench_rr.sh 130000000 80000000 140000000 140000000 140000000 140000000 130000000 40000000 90000000 110000000 SUM=1140000000 After patch: $ bench_rr.sh 430000000 590000000 530000000 450000000 450000000 350000000 450000000 490000000 480000000 460000000 SUM=4680000000 # This is 410 % of the value before patch. Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Doug Porter <dsp@fb.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25arp: fix unused variable warnning when CONFIG_PROC_FS=nYajun Deng
net/ipv4/arp.c:1412:36: warning: unused variable 'arp_seq_ops' [-Wunused-const-variable] Add #ifdef CONFIG_PROC_FS for 'arp_seq_ops'. Fixes: e968b1b3e9b8 ("arp: Remove #ifdef CONFIG_PROC_FS") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25net: dsa: flood multicast to CPU when slave has IFF_PROMISCVladimir Oltean
Certain DSA switches can eliminate flooding to the CPU when none of the ports have the IFF_ALLMULTI or IFF_PROMISC flags set. This is done by synthesizing a call to dsa_port_bridge_flags() for the CPU port, a call which normally comes from the bridge driver via switchdev. The bridge port flags and IFF_PROMISC|IFF_ALLMULTI have slightly different semantics, and due to inattention/lack of proper testing, the IFF_PROMISC flag allows unknown unicast to be flooded to the CPU, but not unknown multicast. This must be fixed by setting both BR_FLOOD (unicast) and BR_MCAST_FLOOD in the synthesized dsa_port_bridge_flags() call, since IFF_PROMISC means that packets should not be filtered regardless of their MAC DA. Fixes: 7569459a52c9 ("net: dsa: manage flooding on the CPU ports") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md modePeilin Ye
As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in collect_md mode is racy for [IP6]GRE[TAP] devices. Consider the following sequence of events: 1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link add ... external". "ip" ignores "[o]seq" if "external" is specified, so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e. it uses lockless TX); 2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g. bpf_skb_set_tunnel_key() in an eBPF program attached to this device; 3. gre_fb_xmit() or __gre6_xmit() processes these skb's: gre_build_header(skb, tun_hlen, flags, protocol, tunnel_id_to_key32(tun_info->key.tun_id), (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0); ^^^^^^^^^^^^^^^^^ Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may try to do this tunnel->o_seqno++ in parallel, which is racy. Fix it by making o_seqno atomic_t. As mentioned by Eric Dumazet in commit b790e01aee74 ("ip_gre: lockless xmit"), making o_seqno atomic_t increases "chance for packets being out of order at receiver" when NETIF_F_LLTX is on. Maybe a better fix would be: 1. Do not ignore "oseq" in external mode. Users MUST specify "oseq" if they want the kernel to allow sequencing of outgoing packets; 2. Reject all outgoing TUNNEL_SEQ packets if the device was not created with "oseq". Unfortunately, that would break userspace. We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us do it in separate patches to keep this fix minimal. Suggested-by: Jakub Kicinski <kuba@kernel.org> Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.") Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25ip6_gre: Make o_seqno start from 0 in native modePeilin Ye
For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in native mode. According to RFC 2890 2.2., "The first datagram is sent with a sequence number of 0." Fix it. It is worth mentioning that o_seqno already starts from 0 in collect_md mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(), where tunnel->o_seqno is passed to gre_build_header() before getting incremented. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25ip_gre: Make o_seqno start from 0 in native modePeilin Ye
For GRE and GRETAP devices, currently o_seqno starts from 1 in native mode. According to RFC 2890 2.2., "The first datagram is sent with a sequence number of 0." Fix it. It is worth mentioning that o_seqno already starts from 0 in collect_md mode, see gre_fb_xmit(), where tunnel->o_seqno is passed to gre_build_header() before getting incremented. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25net/smc: sync err code when tcp connection was refusedliuyacan
In the current implementation, when TCP initiates a connection to an unavailable [ip,port], ECONNREFUSED will be stored in the TCP socket, but SMC will not. However, some apps (like curl) use getsockopt(,,SO_ERROR,,) to get the error information, which makes them miss the error message and behave strangely. Fixes: 50717a37db03 ("net/smc: nonblocking connect rework") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix incorrect printing of memory size of IPVS connection hash table, from Pengcheng Yang. 2) Fix spurious EEXIST errors in nft_set_rbtree. 3) Remove leftover empty flowtable file, from Rongguang Wei. 4) Fix ip6_route_me_harder() with vrf driver, from Martin Willi. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25devlink: introduce line card device info infrastructureJiri Pirko
Extend the line card info message with information (e.g., FW version) about devices found on the line card. Example: $ devlink lc info pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 versions: fixed: hw.revision 0 running: ini.version 4 devices: device 0 versions: running: fw 19.2010.1310 device 1 versions: running: fw 19.2010.1310 device 2 versions: running: fw 19.2010.1310 device 3 versions: running: fw 19.2010.1310 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25devlink: introduce line card info get messageJiri Pirko
Allow the driver to provide per line card info get op to fill-up info, similar to the "devlink dev info". Example: $ devlink lc info pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 versions: fixed: hw.revision 0 running: ini.version 4 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25devlink: introduce line card devices supportJiri Pirko
Line card can contain one or more devices that makes sense to make visible to the user. For example, this can be a gearbox with flash memory, which could be updated. Provide the driver possibility to attach such devices to a line card and expose those to user. Example: $ devlink lc show pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 state active type 16x100G supported_types: 16x100G devices: device 0 device 1 device 2 device 3 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25netfilter: Update ip6_route_me_harder to consider L3 domainMartin Willi
The commit referenced below fixed packet re-routing if Netfilter mangles a routing key property of a packet and the packet is routed in a VRF L3 domain. The fix, however, addressed IPv4 re-routing, only. This commit applies the same behavior for IPv6. While at it, untangle the nested ternary operator to make the code more readable. Fixes: 6d8b49c3a3a3 ("netfilter: Update ip_route_me_harder to consider L3 domain") Cc: stable@vger.kernel.org Signed-off-by: Martin Willi <martin@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-25libceph: disambiguate cluster/pool full log messageIlya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25netfilter: flowtable: Remove the empty fileRongguang Wei
CONFIG_NF_FLOW_TABLE_IPV4 is already removed and the real user is also removed(nf_flow_table_ipv4.c is empty). Fixes: c42ba4290b2147aa ("netfilter: flowtable: remove ipv4/ipv6 modules") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-24net: add __sys_socket_file()Jens Axboe
This works like __sys_socket(), except instead of allocating and returning a socket fd, it just returns the file associated with the socket. No fd is installed into the process file table. This is similar to do_accept(), and allows io_uring to use this without instantiating a file descriptor in the process file table. Signed-off-by: Jens Axboe <axboe@kernel.dk> Acked-by: David S. Miller <davem@davemloft.net> Link: https://lore.kernel.org/r/20220412202240.234207-2-axboe@kernel.dk
2022-04-23sctp: check asoc strreset_chunk in sctp_generate_reconf_eventXin Long
A null pointer reference issue can be triggered when the response of a stream reconf request arrives after the timer is triggered, such as: send Incoming SSN Reset Request ---> CPU0: reconf timer is triggered, go to the handler code before hold sk lock <--- reply with Outgoing SSN Reset Request CPU1: process Outgoing SSN Reset Request, and set asoc->strreset_chunk to NULL CPU0: continue the handler code, hold sk lock, and try to hold asoc->strreset_chunk, crash! In Ying Xu's testing, the call trace is: [ ] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ ] RIP: 0010:sctp_chunk_hold+0xe/0x40 [sctp] [ ] Call Trace: [ ] <IRQ> [ ] sctp_sf_send_reconf+0x2c/0x100 [sctp] [ ] sctp_do_sm+0xa4/0x220 [sctp] [ ] sctp_generate_reconf_event+0xbd/0xe0 [sctp] [ ] call_timer_fn+0x26/0x130 This patch is to fix it by returning from the timer handler if asoc strreset_chunk is already set to NULL. Fixes: 7b9438de0cd4 ("sctp: add stream reconf timer") Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: add mib for infinite map sendingGeliang Tang
This patch adds a new mib named MPTCP_MIB_INFINITEMAPTX, increase it when a infinite mapping has been sent out. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: infinite mapping receivingGeliang Tang
This patch adds the infinite mapping receiving logic. When the infinite mapping is received, set the map_data_len of the subflow to 0. In subflow_check_data_avail(), only reset the subflow when the map_data_len of the subflow is non-zero. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: infinite mapping sendingGeliang Tang
This patch adds the infinite mapping sending logic. Add a new flag send_infinite_map in struct mptcp_subflow_context. Set it true when a single contiguous subflow is in use and the allow_infinite_fallback flag is true in mptcp_pm_mp_fail_received(). In mptcp_sendmsg_frag(), if this flag is true, call the new function mptcp_update_infinite_map() to set the infinite mapping. Add a new flag infinite_map in struct mptcp_ext, set it true in mptcp_update_infinite_map(), and check this flag in a new helper mptcp_check_infinite_map(). In mptcp_update_infinite_map(), set data_len to 0, and clear the send_infinite_map flag, then do fallback. In mptcp_established_options(), use the helper mptcp_check_infinite_map() to let the infinite mapping DSS can be sent out in the fallback mode. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: track and update contiguous data statusGeliang Tang
This patch adds a new member allow_infinite_fallback in mptcp_sock, which is initialized to 'true' when the connection begins and is set to 'false' on any retransmit or successful MP_JOIN. Only do infinite mapping fallback if there is a single subflow AND there have been no retransmissions AND there have never been any MP_JOINs. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: add the fallback checkGeliang Tang
This patch adds the fallback check in subflow_check_data_avail(). Only do the fallback when the msk hasn't fallen back yet. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-23mptcp: don't send RST for single subflowGeliang Tang
When a bad checksum is detected and a single subflow is in use, don't send RST + MP_FAIL, send data_ack + MP_FAIL instead. So invoke tcp_send_active_reset() only when mptcp_has_another_subflow() is true. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22tcp: ensure to use the most recently sent skb when filling the rate samplePengcheng Yang
If an ACK (s)acks multiple skbs, we favor the information from the most recently sent skb by choosing the skb with the highest prior_delivered count. But in the interval between receiving ACKs, we send multiple skbs with the same prior_delivered, because the tp->delivered only changes when we receive an ACK. We used RACK's solution, copying tcp_rack_sent_after() as tcp_skb_sent_after() helper to determine "which packet was sent last?". Later, we will use tcp_skb_sent_after() instead in RACK. Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1650422081-22153-1-git-send-email-yangpc@wangsu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-22net: bridge: switchdev: check br_vlan_group() return valueClément Léger
br_vlan_group() can return NULL and thus return value must be checked to avoid dereferencing a NULL pointer. Fixes: 6284c723d9b9 ("net: bridge: mst: Notify switchdev drivers of VLAN MSTI migrations") Signed-off-by: Clément Léger <clement.leger@bootlin.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220421101247.121896-1-clement.leger@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-22tcp: md5: incorrect tcp_header_len for incoming connectionsFrancesco Ruggeri
In tcp_create_openreq_child we adjust tcp_header_len for md5 using the remote address in newsk. But that address is still 0 in newsk at this point, and it is only set later by the callers (tcp_v[46]_syn_recv_sock). Use the address from the request socket instead. Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220421005026.686A45EC01F2@us226.sjc.aristanetworks.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-22bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hookEyal Birger
xmit_check_hhlen() observes the dst for getting the device hard header length to make sure a modified packet can fit. When a helper which changes the dst - such as bpf_skb_set_tunnel_key() - is called as part of the xmit program the accessed dst is no longer valid. This leads to the following splat: BUG: kernel NULL pointer dereference, address: 00000000000000de #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 798 Comm: ping Not tainted 5.18.0-rc2+ #103 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 RIP: 0010:bpf_xmit+0xfb/0x17f Code: c6 c0 4d cd 8e 48 c7 c7 7d 33 f0 8e e8 42 09 fb ff 48 8b 45 58 48 8b 95 c8 00 00 00 48 2b 95 c0 00 00 00 48 83 e0 fe 48 8b 00 <0f> b7 80 de 00 00 00 39 c2 73 22 29 d0 b9 20 0a 00 00 31 d2 48 89 RSP: 0018:ffffb148c0bc7b98 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000240008 RCX: 0000000000000000 RDX: 0000000000000010 RSI: 00000000ffffffea RDI: 00000000ffffffff RBP: ffff922a828a4e00 R08: ffffffff8f1350e8 R09: 00000000ffffdfff R10: ffffffff8f055100 R11: ffffffff8f105100 R12: 0000000000000000 R13: ffff922a828a4e00 R14: 0000000000000040 R15: 0000000000000000 FS: 00007f414e8f0080(0000) GS:ffff922afdc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000de CR3: 0000000002d80006 CR4: 0000000000370ef0 Call Trace: <TASK> lwtunnel_xmit.cold+0x71/0xc8 ip_finish_output2+0x279/0x520 ? __ip_finish_output.part.0+0x21/0x130 Fix by fetching the device hard header length before running the BPF code. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220420165219.1755407-1-eyal.birger@gmail.com
2022-04-22SUNRPC release the transport of a relocated task with an assigned transportOlga Kornievskaia
A relocated task must release its previous transport. Fixes: 82ee41b85cef1 ("SUNRPC don't resend a task on an offlined transport") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-22netfilter: nft_set_rbtree: overlap detection with element re-addition after ↵Pablo Neira Ayuso
deletion This patch fixes spurious EEXIST errors. Extend d2df92e98a34 ("netfilter: nft_set_rbtree: handle element re-addition after deletion") to deal with elements with same end flags in the same transation. Reset the overlap flag as described by 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"). Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Fixes: d2df92e98a34 ("netfilter: nft_set_rbtree: handle element re-addition after deletion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-22net: dsa: Add missing of_node_put() in dsa_port_link_register_ofMiaoqian Lin
The device_node pointer is returned by of_parse_phandle() with refcount incremented. We should use of_node_put() on it when done. of_node_put() will check for NULL value. Fixes: a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22ipv4: Initialise ->flowi4_scope properly in ICMP handlers.Guillaume Nault
All the *_redirect() and *_update_pmtu() functions initialise their struct flowi4 variable with either __build_flow_key() or build_sk_flow_key(). When sk is provided, these functions use RT_CONN_FLAGS() to set ->flowi4_tos and always use RT_SCOPE_UNIVERSE for ->flowi4_scope. Then they rely on ip_rt_fix_tos() to adjust the scope based on the RTO_ONLINK bit and to mask the tos with IPTOS_RT_MASK. This patch modifies __build_flow_key() and build_sk_flow_key() to properly initialise ->flowi4_tos and ->flowi4_scope, so that the ICMP redirects and PMTU handlers don't need an extra call to ip_rt_fix_tos() before doing a fib lookup. That is, we: * Drop RT_CONN_FLAGS(): use ip_sock_rt_tos() and ip_sock_rt_scope() instead, so that we don't have to rely on ip_rt_fix_tos() to adjust the scope anymore. * Apply IPTOS_RT_MASK to the tos, so that we don't need ip_rt_fix_tos() to do it for us. * Drop the ip_rt_fix_tos() calls that now become useless. The only remaining ip_rt_fix_tos() caller is ip_route_output_key_hash() which needs it as long as external callers still use the RTO_ONLINK flag. Note: This patch also drops some useless RT_TOS() calls as IPTOS_RT_MASK is a stronger mask. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22ipv4: Avoid using RTO_ONLINK with ip_route_connect().Guillaume Nault
Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally, we don't have to rely on the RTO_ONLINK bit to properly set the scope of a flowi4 structure. We can just set ->flowi4_scope explicitly and avoid using RTO_ONLINK in ->flowi4_tos. This patch converts callers of ip_route_connect(). Instead of setting the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can: 1- Drop the tos parameter from ip_route_connect(): its value was entirely based on sk, which is also passed as parameter. 2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option instead of always initialising it with RT_SCOPE_UNIVERSE (let's define ip_sock_rt_scope() for this purpose). 3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is now properly initialised, we don't need to tell ip_rt_fix_tos() to adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(), which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK bit overload. Note: In the original ip_route_connect() code, __ip_route_output_key() might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the original tos variable. Now that we don't set RTO_ONLINK any more, this is not a problem and we can use fl4->flowi4_tos in flowi4_update_output(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().Guillaume Nault
All callers already initialise ->flowi4_scope with RT_SCOPE_UNIVERSE, either by manual field assignment, memset(0) of the whole structure or implicit structure initialisation of on-stack variables (RT_SCOPE_UNIVERSE actually equals 0). Therefore, we don't need to always initialise ->flowi4_scope in ip_rt_fix_tos(). We only need to reduce the scope to RT_SCOPE_LINK when the special RTO_ONLINK flag is present in the tos. This will allow some code simplification, like removing ip_rt_fix_tos(). Also, the long term idea is to remove RTO_ONLINK entirely by properly initialising ->flowi4_scope, instead of overloading ->flowi4_tos with a special flag. Eventually, this will allow to convert ->flowi4_tos to dscp_t. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22ipv6: Use ipv6_only_sock() helper in condition.Kuniyuki Iwashima
This patch replaces some sk_ipv6only tests with ipv6_only_sock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22ipv6: Remove __ipv6_only_sock().Kuniyuki Iwashima
Since commit 9fe516ba3fb2 ("inet: move ipv6only in sock_common"), ipv6_only_sock() and __ipv6_only_sock() are the same macro. Let's remove the one. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22Revert "rtnetlink: return EINVAL when request cannot succeed"Florent Fourcot
This reverts commit b6177d3240a4 ip-link command is testing kernel capability by sending a RTM_NEWLINK request, without any argument. It accepts everything in reply, except EOPNOTSUPP and EINVAL (functions iplink_have_newlink / accept_msg) So we must keep compatiblity here, invalid empty message should not return EINVAL Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Tested-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22net/ipv6: Enforce limits for accept_unsolicited_na sysctlArun Ajith S
Fix mistake in the original patch where limits were specified but the handler didn't take care of the limits. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-22batman-adv: remove unnecessary type castingsYu Zhe
remove unnecessary void* type castings. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> [sven@narfation.org: Fix missing const in batadv_choose* functions] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-04-22batman-adv: Start new development cycleSimon Wunderlich
This version will contain all the (major or even only minor) changes for Linux 5.19. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
drivers/net/ethernet/microchip/lan966x/lan966x_main.c d08ed852560e ("net: lan966x: Make sure to release ptp interrupt") c8349639324a ("net: lan966x: Add FDMA functionality") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-20net: Change skb_ensure_writable()'s write_len param to unsigned int typeLiu Jian
Both pskb_may_pull() and skb_clone_writable()'s length parameters are of type unsigned int already. Therefore, change this function's write_len param to unsigned int type. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220416105801.88708-3-liujian56@huawei.com
2022-04-20bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytesLiu Jian
The data length of skb frags + frag_list may be greater than 0xffff, and skb_header_pointer can not handle negative offset. So, here INT_MAX is used to check the validity of offset. Add the same change to the related function skb_store_bytes. Fixes: 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20220416105801.88708-2-liujian56@huawei.com
2022-04-20net/sched: flower: Consider the number of tags for vlan filtersBoris Sukholitko
Before this patch the existence of vlan filters was conditional on the vlan protocol being matched in the tc rule. For example, the following rule: tc filter add dev eth1 ingress flower vlan_prio 5 was illegal because vlan protocol (e.g. 802.1q) does not appear in the rule. Remove the above restriction by looking at the num_of_vlans filter to allow further matching on vlan attributes. The following rule becomes legal as a result of this commit: tc filter add dev eth1 ingress flower num_of_vlans 1 vlan_prio 5 because having num_of_vlans==1 implies that the packet is single tagged. Change is_vlan_key helper to look at the number of vlans in addition to the vlan ethertype. The outcome of this change is that outer (e.g. vlan_prio) and inner (e.g. cvlan_prio) tag vlan filters require the number of vlan tags to be greater then 0 and 1 accordingly. As a result of is_vlan_key change, the ethertype may be set to 0 when matching on the number of vlans. Update fl_set_key_vlan to avoid setting key, mask vlan_tpid for the 0 ethertype. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net/sched: flower: Add number of vlan tags filterBoris Sukholitko
These are bookkeeping parts of the new num_of_vlans filter. Defines, dump, load and set are being done here. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20flow_dissector: Add number of vlan tags dissectorBoris Sukholitko
Our customers in the fiber telecom world have network configurations where they would like to control their traffic according to the number of tags appearing in the packet. For example, TR247 GPON conformance test suite specification mostly talks about untagged, single, double tagged packets and gives lax guidelines on the vlan protocol vs. number of vlan tags. This is different from the common IT networks where 802.1Q and 802.1ad protocols are usually describe single and double tagged packet. GPON configurations that we work with have arbitrary mix the above protocols and number of vlan tags in the packet. The goal is to make the following TC commands possible: tc filter add dev eth1 ingress flower \ num_of_vlans 1 vlan_prio 5 action drop From our logs, we have redirect rules such that: tc filter add dev $GPON ingress flower num_of_vlans $N \ action mirred egress redirect dev $DEV where N can range from 0 to 3 and $DEV is the function of $N. Also there are rules setting skb mark based on the number of vlans: tc filter add dev $GPON ingress flower num_of_vlans $N vlan_prio \ $P action skbedit mark $M This new dissector allows extracting the number of vlan tags existing in the packet. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net/sched: flower: Reduce identation after is_key_vlan refactoringBoris Sukholitko
Whitespace only. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net/sched: flower: Helper function for vlan ethtype checksBoris Sukholitko
There are somewhat repetitive ethertype checks in fl_set_key. Refactor them into is_vlan_key helper function. To make the changes clearer, avoid touching identation levels. This is the job for the next patch in the series. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net: dsa: don't emit targeted cross-chip notifiers for MTU changeVladimir Oltean
A cross-chip notifier with "targeted_match=true" is one that matches only the local port of the switch that emitted it. In other words, passing through the cross-chip notifier layer serves no purpose. Eliminate this concept by calling directly ds->ops->port_change_mtu instead of emitting a targeted cross-chip notifier. This leaves the DSA_NOTIFIER_MTU event being emitted only for MTU updates on the CPU port, which need to be reflected also across all DSA links. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net: dsa: drop dsa_slave_priv from dsa_slave_change_mtuVladimir Oltean
We can get a hold of the "ds" pointer directly from "dp", no need for the dsa_slave_priv. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net: dsa: avoid one dsa_to_port() in dsa_slave_change_mtuVladimir Oltean
We could retrieve the cpu_dp pointer directly from the "dp" we already have, no need to resort to dsa_to_port(ds, port). This change also removes the need for an "int port", so that is also deleted. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net: dsa: use dsa_tree_for_each_user_port in dsa_slave_change_mtuVladimir Oltean
Use the more conventional iterator over user ports instead of explicitly ignoring them, and use the more conventional name "other_dp" instead of "dp_iter", for readability. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>