summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-06-11net: dsa: sja1105: implement TX timestamping for SJA1110Vladimir Oltean
The TX timestamping procedure for SJA1105 is a bit unconventional because the transmit procedure itself is unconventional. Control packets (and therefore PTP as well) are transmitted to a specific port in SJA1105 using "management routes" which must be written over SPI to the switch. These are one-shot rules that match by destination MAC address on traffic coming from the CPU port, and select the precise destination port for that packet. So to transmit a packet from NET_TX softirq context, we actually need to defer to a process context so that we can perform that SPI write before we send the packet. The DSA master dev_queue_xmit() runs in process context, and we poll until the switch confirms it took the TX timestamp, then we annotate the skb clone with that TX timestamp. This is why the sja1105 driver does not need an skb queue for TX timestamping. But the SJA1110 is a bit (not much!) more conventional, and you can request 2-step TX timestamping through the DSA header, as well as give the switch a cookie (timestamp ID) which it will give back to you when it has the timestamp. So now we do need a queue for keeping the skb clones until their TX timestamps become available. The interesting part is that the metadata frames from SJA1105 haven't disappeared completely. On SJA1105 they were used as follow-ups which contained RX timestamps, but on SJA1110 they are actually TX completion packets, which contain a variable (up to 32) array of timestamps. Why an array? Because: - not only is the TX timestamp on the egress port being communicated, but also the RX timestamp on the CPU port. Nice, but we don't care about that, so we ignore it. - because a packet could be multicast to multiple egress ports, each port takes its own timestamp, and the TX completion packet contains the individual timestamps on each port. This is unconventional because switches typically have a timestamping FIFO and raise an interrupt, but this one doesn't. So the tagger needs to detect and parse meta frames, and call into the main switch driver, which pairs the timestamps with the skbs in the TX timestamping queue which are waiting for one. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: add support for the SJA1110 native tagging protocolVladimir Oltean
The SJA1110 has improved a few things compared to SJA1105: - To send a control packet from the host port with SJA1105, one needed to program a one-shot "management route" over SPI. This is no longer true with SJA1110, you can actually send "in-band control extensions" in the packets sent by DSA, these are in fact DSA tags which contain the destination port and switch ID. - When receiving a control packet from the switch with SJA1105, the source port and switch ID were written in bytes 3 and 4 of the destination MAC address of the frame (which was a very poor shot at a DSA header). If the control packet also had an RX timestamp, that timestamp was sent in an actual follow-up packet, so there were reordering concerns on multi-core/multi-queue DSA masters, where the metadata frame with the RX timestamp might get processed before the actual packet to which that timestamp belonged (there is no way to pair a packet to its timestamp other than the order in which they were received). On SJA1110, this is no longer true, control packets have the source port, switch ID and timestamp all in the DSA tags. - Timestamps from the switch were partial: to get a 64-bit timestamp as required by PTP stacks, one would need to take the partial 24-bit or 32-bit timestamp from the packet, then read the current PTP time very quickly, and then patch in the high bits of the current PTP time into the captured partial timestamp, to reconstruct what the full 64-bit timestamp must have been. That is awful because packet processing is done in NAPI context, but reading the current PTP time is done over SPI and therefore needs sleepable context. But it also aggravated a few things: - Not only is there a DSA header in SJA1110, but there is a DSA trailer in fact, too. So DSA needs to be extended to support taggers which have both a header and a trailer. Very unconventional - my understanding is that the trailer exists because the timestamps couldn't be prepared in time for putting them in the header area. - Like SJA1105, not all packets sent to the CPU have the DSA tag added to them, only control packets do: * the ones which match the destination MAC filters/traps in MAC_FLTRES1 and MAC_FLTRES0 * the ones which match FDB entries which have TRAP or TAKETS bits set So we could in theory hack something up to request the switch to take timestamps for all packets that reach the CPU, and those would be DSA-tagged and contain the source port / switch ID by virtue of the fact that there needs to be a timestamp trailer provided. BUT: - The SJA1110 does not parse its own DSA tags in a way that is useful for routing in cross-chip topologies, a la Marvell. And the sja1105 driver already supports cross-chip bridging from the SJA1105 days. It does that by automatically setting up the DSA links as VLAN trunks which contain all the necessary tag_8021q RX VLANs that must be communicated between the switches that span the same bridge. So when using tag_8021q on sja1105, it is possible to have 2 switches with ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and sw1p1, and forwarding will happen according to the expected rules of the Linux bridge. We like that, and we don't want that to go away, so as a matter of fact, the SJA1110 tagger still needs to support tag_8021q. So the sja1110 tagger is a hybrid between tag_8021q for data packets, and the native hardware support for control packets. On RX, packets have a 13-byte trailer if they contain an RX timestamp. That trailer is padded in such a way that its byte 8 (the start of the "residence time" field - not parsed by Linux because we don't care) is aligned on a 16 byte boundary. So the padding has a variable length between 0 and 15 bytes. The DSA header contains the offset of the beginning of the padding relative to the beginning of the frame (and the end of the padding is obviously the end of the packet minus 13 bytes, the length of the trailer). So we discard it. Packets which don't have a trailer contain the source port and switch ID information in the header (they are "trap-to-host" packets). Packets which have a trailer contain the source port and switch ID in the trailer. On TX, the destination port mask and switch ID is always in the trailer, so we always need to say in the header that a trailer is present. The header needs a custom EtherType and this was chosen as 0xdadc, after 0xdada which is for Marvell and 0xdadb which is for VLANs in VLAN-unaware mode on SJA1105 (and SJA1110 in fact too). Because we use tag_8021q in concert with the native tagging protocol, control packets will have 2 DSA tags. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: sja1105: make SJA1105_SKB_CB fit a full timestampVladimir Oltean
In SJA1105, RX timestamps for packets sent to the CPU are transmitted in separate follow-up packets (metadata frames). These contain partial timestamps (24 or 32 bits) which are kept in SJA1105_SKB_CB(skb)->meta_tstamp. Thankfully, SJA1110 improved that, and the RX timestamps are now transmitted in-band with the actual packet, in the timestamp trailer. The RX timestamps are now full-width 64 bits. Because we process the RX DSA tags in the rcv() method in the tagger, but we would like to preserve the DSA code structure in that we populate the skb timestamp in the port_rxtstamp() call which only happens later, the implication is that we must somehow pass the 64-bit timestamp from the rcv() method all the way to port_rxtstamp(). We can use the skb->cb for that. Rename the meta_tstamp from struct sja1105_skb_cb from "meta_tstamp" to "tstamp", and increase its size to 64 bits. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: tag_8021q: refactor RX VLAN parsing into a dedicated functionVladimir Oltean
The added value of this function is that it can deal with both the case where the VLAN header is in the skb head, as well as in the offload field. This is something I was not able to do using other functions in the network stack. Since both ocelot-8021q and sja1105 need to do the same stuff, let's make it a common service provided by tag_8021q. This is done as refactoring for the new SJA1110 tagger, which partly uses tag_8021q as well (just like SJA1105), and will be the third caller. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: tag_sja1105: stop resetting network and transport headersVladimir Oltean
This makes no sense and is not needed, it is probably a debugging leftover. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: generalize overhead for taggers that use both headers and trailersVladimir Oltean
Some really really weird switches just couldn't decide whether to use a normal or a tail tagger, so they just did both. This creates problems for DSA, because we only have the concept of an 'overhead' which can be applied to the headroom or to the tailroom of the skb (like for example during the central TX reallocation procedure), depending on the value of bool tail_tag, but not to both. We need to generalize DSA to cater for these odd switches by transforming the 'overhead / tail_tag' pair into 'needed_headroom / needed_tailroom'. The DSA master's MTU is increased to account for both. The flow dissector code is modified such that it only calls the DSA adjustment callback if the tagger has a non-zero header length. Taggers are trivially modified to declare either needed_headroom or needed_tailroom, based on the tail_tag value that they currently declare. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11xfrm: merge dstopt and routing hdroff functionsFlorian Westphal
Both functions are very similar, so merge them into one. The nexthdr is passed as argument to break the loop in the ROUTING case, this is the only header type where slightly different rules apply. While at it, the merged function is realigned with ip6_find_1stfragopt(). That function received bug fixes for an infinite loop, but neither dstopt nor rh parsing functions (copy-pasted from ip6_find_1stfragopt) were changed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: remove hdr_offset indirectionFlorian Westphal
After previous patches all remaining users set the function pointer to the same function: xfrm6_find_1stfragopt. So remove this function pointer and call ip6_find_1stfragopt directly. Reduces size of xfrm_type to 64 bytes on 64bit platforms. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: ipv6: move mip6_rthdr_offset into xfrm coreFlorian Westphal
Place the call into the xfrm core. After this all remaining users set the hdr_offset function pointer to the same function which opens the possiblity to remove the indirection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: ipv6: move mip6_destopt_offset into xfrm coreFlorian Westphal
This helper is relatively small, just move this to the xfrm core and call it directly. Next patch does the same for the ROUTING type. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: ipv6: add xfrm6_hdr_offset helperFlorian Westphal
This moves the ->hdr_offset indirect call to a new helper. A followup patch can then modify the new function to replace the indirect call by direct calls to the required hdr_offset helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-11xfrm: policy: fix a spelling mistakegushengxian
Fix a spelling mistake. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-06-10mptcp: fix soft lookup in subflow_error_report()Paolo Abeni
Maxim reported a soft lookup in subflow_error_report(): watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] RIP: 0010:native_queued_spin_lock_slowpath RSP: 0018:ffffa859c0003bc0 EFLAGS: 00000202 RAX: 0000000000000101 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff9195c2772d88 RSI: 0000000000000000 RDI: ffff9195c2772d88 RBP: ffff9195c2772d00 R08: 00000000000067b0 R09: c6e31da9eb1e44f4 R10: ffff9195ef379700 R11: ffff9195edb50710 R12: ffff9195c2772d88 R13: ffff9195f500e3d0 R14: ffff9195ef379700 R15: ffff9195ef379700 FS: 0000000000000000(0000) GS:ffff91961f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c000407000 CR3: 0000000002988000 CR4: 00000000000006f0 Call Trace: <IRQ> _raw_spin_lock_bh subflow_error_report mptcp_subflow_data_available __mptcp_move_skbs_from_subflow mptcp_data_ready tcp_data_queue tcp_rcv_established tcp_v4_do_rcv tcp_v4_rcv ip_protocol_deliver_rcu ip_local_deliver_finish __netif_receive_skb_one_core netif_receive_skb rtl8139_poll 8139too __napi_poll net_rx_action __do_softirq __irq_exit_rcu common_interrupt </IRQ> The calling function - mptcp_subflow_data_available() - can be invoked from different contexts: - plain ssk socket lock - ssk socket lock + mptcp_data_lock - ssk socket lock + mptcp_data_lock + msk socket lock. Since subflow_error_report() tries to acquire the mptcp_data_lock, the latter two call chains will cause soft lookup. This change addresses the issue moving the error reporting call to outer functions, where the held locks list is known and the we can acquire only the needed one. Reported-by: Maxim Galaganov <max@internet.ru> Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/199 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10mptcp: do not warn on bad input from the networkPaolo Abeni
warn_bad_map() produces a kernel WARN on bad input coming from the network. Use pr_debug() to avoid spamming the system log. Additionally, when the right bound check fails, warn_bad_map() reports the wrong ssn value, let's fix it. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/107 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10mptcp: wake-up readers only for in sequence dataPaolo Abeni
Currently we rely on the subflow->data_avail field, which is subject to races: ssk1 skb len = 500 DSS(seq=1, len=1000, off=0) # data_avail == MPTCP_SUBFLOW_DATA_AVAIL ssk2 skb len = 500 DSS(seq = 501, len=1000) # data_avail == MPTCP_SUBFLOW_DATA_AVAIL ssk1 skb len = 500 DSS(seq = 1, len=1000, off =500) # still data_avail == MPTCP_SUBFLOW_DATA_AVAIL, # as the skb is covered by a pre-existing map, # which was in-sequence at reception time. Instead we can explicitly check if some has been received in-sequence, propagating the info from __mptcp_move_skbs_from_subflow(). Additionally add the 'ONCE' annotation to the 'data_avail' memory access, as msk will read it outside the subflow socket lock. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10mptcp: try harder to borrow memory from subflow under pressurePaolo Abeni
If the host is under sever memory pressure, and RX forward memory allocation for the msk fails, we try to borrow the required memory from the ingress subflow. The current attempt is a bit flaky: if skb->truesize is less than SK_MEM_QUANTUM, the ssk will not release any memory, and the next schedule will fail again. Instead, directly move the required amount of pages from the ssk to the msk, if available Fixes: 9c3f94e1681b ("mptcp: add missing memory scheduling in the rx path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix a crash when stateful expression with its own gc callback is used in a set definition. 2) Skip IPv6 packets from any link-local address in IPv6 fib expression. Add a selftest for this scenario, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10sch_cake: Fix out of bounds when parsing TCP options and headerMaxim Mikityanskiy
The TCP option parser in cake qdisc (cake_get_tcpopt and cake_tcph_may_drop) could read one byte out of bounds. When the length is 1, the execution flow gets into the loop, reads one byte of the opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds the length of 1. This fix is inspired by commit 9609dad263f8 ("ipv4: tcp_input: fix stack out of bounds when parsing TCP options."). v2 changes: Added doff validation in cake_get_tcphdr to avoid parsing garbage as TCP header. Although it wasn't strictly an out-of-bounds access (memory was allocated), garbage values could be read where CAKE expected the TCP header if doff was smaller than 5. Cc: Young Xiao <92siuyang@gmail.com> Fixes: 8b7138814f29 ("sch_cake: Add optional ACK filter") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10mptcp: Fix out of bounds when parsing TCP optionsMaxim Mikityanskiy
The TCP option parser in mptcp (mptcp_get_options) could read one byte out of bounds. When the length is 1, the execution flow gets into the loop, reads one byte of the opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds the length of 1. This fix is inspired by commit 9609dad263f8 ("ipv4: tcp_input: fix stack out of bounds when parsing TCP options."). Cc: Young Xiao <92siuyang@gmail.com> Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10netfilter: synproxy: Fix out of bounds when parsing TCP optionsMaxim Mikityanskiy
The TCP option parser in synproxy (synproxy_parse_options) could read one byte out of bounds. When the length is 1, the execution flow gets into the loop, reads one byte of the opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads one more byte, which exceeds the length of 1. This fix is inspired by commit 9609dad263f8 ("ipv4: tcp_input: fix stack out of bounds when parsing TCP options."). v2 changes: Added an early return when length < 0 to avoid calling skb_header_pointer with negative length. Cc: Young Xiao <92siuyang@gmail.com> Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10net/packet: annotate data race in packet_sendmsg()Eric Dumazet
There is a known race in packet_sendmsg(), addressed in commit 32d3182cd2cd ("net/packet: fix race in tpacket_snd()") Now we have data_race(), we can use it to avoid a future KCSAN warning, as syzbot loves stressing af_packet sockets :) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10dccp: tfrc: fix doc warnings in tfrc_equation.cBaokun Li
Add description for `tfrc_invert_loss_event_rate` to fix the W=1 warnings: net/dccp/ccids/lib/tfrc_equation.c:695: warning: Function parameter or member 'loss_event_rate' not described in 'tfrc_invert_loss_event_rate' Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10atm: Use list_for_each_entry() to simplify code in resources.cWang Hai
Convert list_for_each() to list_for_each_entry() where applicable. This simplifies the code. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10net: x25: Use list_for_each_entry() to simplify code in x25_route.cWang Hai
Convert list_for_each() to list_for_each_entry() where applicable. This simplifies the code. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10net: bridge: fix vlan tunnel dst refcnt when egressingNikolay Aleksandrov
The egress tunnel code uses dst_clone() and directly sets the result which is wrong because the entry might have 0 refcnt or be already deleted, causing number of problems. It also triggers the WARN_ON() in dst_hold()[1] when a refcnt couldn't be taken. Fix it by using dst_hold_safe() and checking if a reference was actually taken before setting the dst. [1] dmesg WARN_ON log and following refcnt errors WARNING: CPU: 5 PID: 38 at include/net/dst.h:230 br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Modules linked in: 8021q garp mrp bridge stp llc bonding ipv6 virtio_net CPU: 5 PID: 38 Comm: ksoftirqd/5 Kdump: loaded Tainted: G W 5.13.0-rc3+ #360 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 RIP: 0010:br_handle_egress_vlan_tunnel+0x10b/0x134 [bridge] Code: e8 85 bc 01 e1 45 84 f6 74 90 45 31 f6 85 db 48 c7 c7 a0 02 19 a0 41 0f 94 c6 31 c9 31 d2 44 89 f6 e8 64 bc 01 e1 85 db 75 02 <0f> 0b 31 c9 31 d2 44 89 f6 48 c7 c7 70 02 19 a0 e8 4b bc 01 e1 49 RSP: 0018:ffff8881003d39e8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffa01902a0 RBP: ffff8881040c6700 R08: 0000000000000000 R09: 0000000000000001 R10: 2ce93d0054fe0d00 R11: 54fe0d00000e0000 R12: ffff888109515000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000401 FS: 0000000000000000(0000) GS:ffff88822bf40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f42ba70f030 CR3: 0000000109926000 CR4: 00000000000006e0 Call Trace: br_handle_vlan+0xbc/0xca [bridge] __br_forward+0x23/0x164 [bridge] deliver_clone+0x41/0x48 [bridge] br_handle_frame_finish+0x36f/0x3aa [bridge] ? skb_dst+0x2e/0x38 [bridge] ? br_handle_ingress_vlan_tunnel+0x3e/0x1c8 [bridge] ? br_handle_frame_finish+0x3aa/0x3aa [bridge] br_handle_frame+0x2c3/0x377 [bridge] ? __skb_pull+0x33/0x51 ? vlan_do_receive+0x4f/0x36a ? br_handle_frame_finish+0x3aa/0x3aa [bridge] __netif_receive_skb_core+0x539/0x7c6 ? __list_del_entry_valid+0x16e/0x1c2 __netif_receive_skb_list_core+0x6d/0xd6 netif_receive_skb_list_internal+0x1d9/0x1fa gro_normal_list+0x22/0x3e dev_gro_receive+0x55b/0x600 ? detach_buf_split+0x58/0x140 napi_gro_receive+0x94/0x12e virtnet_poll+0x15d/0x315 [virtio_net] __napi_poll+0x2c/0x1c9 net_rx_action+0xe6/0x1fb __do_softirq+0x115/0x2d8 run_ksoftirqd+0x18/0x20 smpboot_thread_fn+0x183/0x19c ? smpboot_unregister_percpu_thread+0x66/0x66 kthread+0x10a/0x10f ? kthread_mod_delayed_work+0xb6/0xb6 ret_from_fork+0x22/0x30 ---[ end trace 49f61b07f775fd2b ]--- dst_release: dst:00000000c02d677a refcnt:-1 dst_release underflow Cc: stable@vger.kernel.org Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10net: bridge: fix vlan tunnel dst null pointer dereferenceNikolay Aleksandrov
This patch fixes a tunnel_dst null pointer dereference due to lockless access in the tunnel egress path. When deleting a vlan tunnel the tunnel_dst pointer is set to NULL without waiting a grace period (i.e. while it's still usable) and packets egressing are dereferencing it without checking. Use READ/WRITE_ONCE to annotate the lockless use of tunnel_id, use RCU for accessing tunnel_dst and make sure it is read only once and checked in the egress path. The dst is already properly RCU protected so we don't need to do anything fancy than to make sure tunnel_id and tunnel_dst are read only once and checked in the egress path. Cc: stable@vger.kernel.org Fixes: 11538d039ac6 ("bridge: vlan dst_metadata hooks in ingress and egress paths") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10tipc:subscr.c: fix a spelling mistakegushengxian
Fix a spelling mistake. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10tipc: socket.c: fix the use of copular verbgushengxian
Fix the use of copular verb. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10node.c: fix the use of indefinite articlegushengxian
Fix the use of indefinite article. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10af_unix: remove the repeated word "and"gushengxian
Remove the repeated word "and". Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10ping: Check return value of function 'ping_queue_rcv_skb'Zheng Yongjun
Function 'ping_queue_rcv_skb' not always return success, which will also return fail. If not check the wrong return value of it, lead to function `ping_rcv` return success. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10vsock/vmci: remove the repeated word "be"gushengxian
Remove the repeated word "be". Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10skbuff: fix incorrect msg_zerocopy copy notificationsWillem de Bruijn
msg_zerocopy signals if a send operation required copying with a flag in serr->ee.ee_code. This field can be incorrect as of the below commit, as a result of both structs uarg and serr pointing into the same skb->cb[]. uarg->zerocopy must be read before skb->cb[] is reinitialized to hold serr. Similar to other fields len, hi and lo, use a local variable to temporarily hold the value. This was not a problem before, when the value was passed as a function argument. Fixes: 75518851a2a0 ("skbuff: Push status and refcounts into sock_zerocopy_callback") Reported-by: Talal Ahmad <talalahmad@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09net/sched: act_ct: handle DNAT tuple collisionMarcelo Ricardo Leitner
This this the counterpart of 8aa7b526dc0b ("openvswitch: handle DNAT tuple collision") for act_ct. From that commit changelog: """ With multiple DNAT rules it's possible that after destination translation the resulting tuples collide. ... Netfilter handles this case by allocating a null binding for SNAT at egress by default. Perform the same operation in openvswitch for DNAT if no explicit SNAT is requested by the user and allocate a null binding for SNAT for packets in the "original" direction. """ Fixes: 95219afbb980 ("act_ct: support asymmetric conntrack") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09rtnetlink: Fix regression in bridge VLAN configurationIdo Schimmel
Cited commit started returning errors when notification info is not filled by the bridge driver, resulting in the following regression: # ip link add name br1 type bridge vlan_filtering 1 # bridge vlan add dev br1 vid 555 self pvid untagged RTNETLINK answers: Invalid argument As long as the bridge driver does not fill notification info for the bridge device itself, an empty notification should not be considered as an error. This is explained in commit 59ccaaaa49b5 ("bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify"). Fix by removing the error and add a comment to avoid future bugs. Fixes: a8db57c1d285 ("rtnetlink: Fix missing error code in rtnl_bridge_notify()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add nfgenmsg field to nfnetlink's struct nfnl_info and use it. 2) Remove nft_ctx_init_from_elemattr() and nft_ctx_init_from_setattr() helper functions. 3) Add the nf_ct_pernet() helper function to fetch the conntrack pernetns data area. 4) Expose TCP and UDP flowtable offload timeouts through sysctl, from Oz Shlomo. 5) Add nfnetlink_hook subsystem to fetch the netfilter hook pipeline configuration, from Florian Westphal. This also includes a new field to annotate the hook type as metadata. 6) Fix unsafe memory access to non-linear skbuff in the new SCTP chunk support for nft_exthdr, from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09Merge tag 'mac80211-for-net-2021-06-09' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes berg says: ==================== A fair number of fixes: * fix more fallout from RTNL locking changes * fixes for some of the bugs found by syzbot * drop multicast fragments in mac80211 to align with the spec and what drivers are doing now * fix NULL-ptr deref in radiotap injection ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09devlink: Fix error message in devlink_rate_set_ops_supported()Dan Carpenter
The WARN_ON() macro takes a condition, it doesn't take a message. Use WARN() instead. Fixes: 1897db2ec310 ("devlink: Allow setting tx rate for devlink rate leaf objects") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09udp: fix race between close() and udp_abort()Paolo Abeni
Kaustubh reported and diagnosed a panic in udp_lib_lookup(). The root cause is udp_abort() racing with close(). Both racing functions acquire the socket lock, but udp{v6}_destroy_sock() release it before performing destructive actions. We can't easily extend the socket lock scope to avoid the race, instead use the SOCK_DEAD flag to prevent udp_abort from doing any action when the critical race happens. Diagnosed-and-tested-by: Kaustubh Pandey <kapandey@codeaurora.org> Fixes: 5d77dca82839 ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09inet: annotate data race in inet_send_prepare() and inet_dgram_connect()Eric Dumazet
Both functions are known to be racy when reading inet_num as we do not want to grab locks for the common case the socket has been bound already. The race is resolved in inet_autobind() by reading again inet_num under the socket lock. syzbot reported: BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0: udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308 udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89 inet_autobind net/ipv4/af_inet.c:183 [inline] inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1: inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x9db4 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09net/x25: fix a mistake in grammargushengxian
Fix a mistake in grammar. Signed-off-by: gushengxian <gushengxian@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09net: ethtool: clear heap allocations for ethtool functionAustin Kim
Several ethtool functions leave heap uncleared (potentially) by drivers. This will leave the unused portion of heap unchanged and might copy the full contents back to userspace. Signed-off-by: Austin Kim <austindh.kim@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09netfilter: nf_tables: move base hook annotation to init helperFlorian Westphal
coverity scanner says: 2187 if (nft_is_base_chain(chain)) { vvv CID 1505166: Memory - corruptions (UNINIT) vvv Using uninitialized value "basechain". 2188 basechain->ops.hook_ops_type = NF_HOOK_OP_NF_TABLES; ... I don't see how nft_is_base_chain() can evaluate to true while basechain pointer is garbage. However, it seems better to place the NF_HOOK_OP_NF_TABLES annotation in nft_basechain_hook_init() instead. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1505166 ("Memory - corruptions") Fixes: 65b8b7bfc5284f ("netfilter: annotate nf_tables base hook ops") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09netfilter: nfnetlink_hook: add depends-on nftablesFlorian Westphal
nfnetlink_hook.c: In function 'nfnl_hook_put_nft_chain_info': nfnetlink_hook.c:76:7: error: implicit declaration of 'nft_is_active' This macro is only defined when NF_TABLES is enabled. While its possible to also add an ifdef-guard, the infrastructure is currently not useful without nf_tables. Reported-by: kernel test robot <lkp@intel.com> Fixes: 252956528caa ("netfilter: add new hook nfnl subsystem") Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09netfilter: nfnetlink_hook: fix array index out-of-bounds errorColin Ian King
Currently the array net->nf.hooks_ipv6 is accessed by index hook before hook is sanity checked. Fix this by moving the sanity check to before the array access. Addresses-Coverity: ("Out-of-bounds access") Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-localFlorian Westphal
The ip6tables rpfilter match has an extra check to skip packets with "::" source address. Extend this to ipv6 fib expression. Else ipv6 duplicate address detection packets will fail rpf route check -- lookup returns -ENETUNREACH. While at it, extend the prerouting check to also cover the ingress hook. Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1543 Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09netfilter: nf_tables: initialize set before expression setupPablo Neira Ayuso
nft_set_elem_expr_alloc() needs an initialized set if expression sets on the NFT_EXPR_GC flag. Move set fields initialization before expression setup. [4512935.019450] ================================================================== [4512935.019456] BUG: KASAN: null-ptr-deref in nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables] [4512935.019487] Read of size 8 at addr 0000000000000070 by task nft/23532 [4512935.019494] CPU: 1 PID: 23532 Comm: nft Not tainted 5.12.0-rc4+ #48 [...] [4512935.019502] Call Trace: [4512935.019505] dump_stack+0x89/0xb4 [4512935.019512] ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables] [4512935.019536] ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables] [4512935.019560] kasan_report.cold.12+0x5f/0xd8 [4512935.019566] ? nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables] [4512935.019590] nft_set_elem_expr_alloc+0x84/0xd0 [nf_tables] [4512935.019615] nf_tables_newset+0xc7f/0x1460 [nf_tables] Reported-by: syzbot+ce96ca2b1d0b37c6422d@syzkaller.appspotmail.com Fixes: 65038428b2c6 ("netfilter: nf_tables: allow to specify stateful expression in set definition") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-09mac80211: drop multicast fragmentsJohannes Berg
These are not permitted by the spec, just drop them. Link: https://lore.kernel.org/r/20210609161305.23def022b750.Ibd6dd3cdce573dae262fcdc47f8ac52b883a9c50@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-09mac80211: move interface shutdown out of wiphy lockJohannes Berg
When reconfiguration fails, we shut down everything, but we cannot call cfg80211_shutdown_all_interfaces() with the wiphy mutex held. Since cfg80211 now calls it on resume errors, we only need to do likewise for where we call reconfig (whether directly or indirectly), but not under the wiphy lock. Cc: stable@vger.kernel.org Fixes: 2fe8ef106238 ("cfg80211: change netdev registration/unregistration semantics") Link: https://lore.kernel.org/r/20210608113226.78233c80f548.Iecc104aceb89f0568f50e9670a9cb191a1c8887b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-06-09cfg80211: shut down interfaces on failed resumeJohannes Berg
If resume fails, we should shut down all interfaces as the hardware is probably dead. This was/is already done now in mac80211, but we need to change that due to locking issues, so move it here and do it without the wiphy lock held. Cc: stable@vger.kernel.org Fixes: 2fe8ef106238 ("cfg80211: change netdev registration/unregistration semantics") Link: https://lore.kernel.org/r/20210608113226.d564ca69de7c.I2e3c3e5d410b72a4f63bade4fb075df041b3d92f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>