summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-01-08net: make sure devices go through netdev_wait_all_refsJakub Kicinski
If register_netdevice() fails at the very last stage - the notifier call - some subsystems may have already seen it and grabbed a reference. struct net_device can't be freed right away without calling netdev_wait_all_refs(). Now that we have a clean interface in form of dev->needs_free_netdev and lenient free_netdev() we can undo what commit 93ee31f14f6f ("[NET]: Fix free_netdev on register_netdev failure.") has done and complete the unregistration path by bringing the net_set_todo() call back. After registration fails user is still expected to explicitly free the net_device, so make sure ->needs_free_netdev is cleared, otherwise rolling back the registration will cause the old double free for callers who release rtnl_lock before the free. This also solves the problem of priv_destructor not being called on notifier error. net_set_todo() will be moved back into unregister_netdevice_queue() in a follow up. Reported-by: Hulk Robot <hulkci@huawei.com> Reported-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08net: make free_netdev() more lenient with unregistering devicesJakub Kicinski
There are two flavors of handling netdev registration: - ones called without holding rtnl_lock: register_netdev() and unregister_netdev(); and - those called with rtnl_lock held: register_netdevice() and unregister_netdevice(). While the semantics of the former are pretty clear, the same can't be said about the latter. The netdev_todo mechanism is utilized to perform some of the device unregistering tasks and it hooks into rtnl_unlock() so the locked variants can't actually finish the work. In general free_netdev() does not mix well with locked calls. Most drivers operating under rtnl_lock set dev->needs_free_netdev to true and expect core to make the free_netdev() call some time later. The part where this becomes most problematic is error paths. There is no way to unwind the state cleanly after a call to register_netdevice(), since unreg can't be performed fully without dropping locks. Make free_netdev() more lenient, and defer the freeing if device is being unregistered. This allows error paths to simply call free_netdev() both after register_netdevice() failed, and after a call to unregister_netdevice() but before dropping rtnl_lock. Simplify the error paths which are currently doing gymnastics around free_netdev() handling. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08docs: net: explain struct net_device lifetimeJakub Kicinski
Explain the two basic flows of struct net_device's operation. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08net: ip_tunnel: clean up endianness conversionsJulian Wiedmann
sparse complains about some harmless endianness issues: > net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16 > net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types) > net/ipv4/ip_tunnel_core.c:225:43: expected restricted __be16 [usertype] mtu > net/ipv4/ip_tunnel_core.c:225:43: got unsigned short [usertype] iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion when storing the MTU into the ICMPv4 packet. Use htons(), just like iptunnel_pmtud_build_icmpv6() does. > net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16 > net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types) > net/ipv4/ip_tunnel_core.c:248:35: expected unsigned short type > net/ipv4/ip_tunnel_core.c:248:35: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16 > net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types) > net/ipv4/ip_tunnel_core.c:341:35: expected unsigned short type > net/ipv4/ip_tunnel_core.c:341:35: got restricted __be16 [usertype] eth_header() wants the Ethertype in host-order, use the correct flavour of byte-order conversion. > net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer > net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:609:30: expected int type > net/ipv4/ip_tunnel_core.c:609:30: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:619:30: expected int type > net/ipv4/ip_tunnel_core.c:619:30: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:629:30: expected int type > net/ipv4/ip_tunnel_core.c:629:30: got restricted __be16 [usertype] The TUNNEL_* types are big-endian, so adjust the type of the local variable in ip_tun_parse_opts(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08udp: Prevent reuseport_select_sock from reading uninitialized socksBaptiste Lepers
reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, only read the array up until the last known safe index instead of incorrectly re-reading the last index of the array. Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08net: fix use-after-free when UDP GRO with shared fraglistDongseok Yi
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC writes, it will call skb_ensure_writable -> pskb_expand_head to create a private linear section for the head_skb. And then call skb_clone_fraglist -> skb_get on each skb in the fraglist. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in PF_PACKET. Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have a link for the same frag_skbs chain. If a new skb (not frags) is queued to one of the sk_receive_queue, multiple ptypes can see and release this. It causes use-after-free. [ 4443.426215] ------------[ cut here ]------------ [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi <dseok.yi@samsung.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08net, xdp: Introduce xdp_prepare_buff utility routineLorenzo Bianconi
Introduce xdp_prepare_buff utility routine to initialize per-descriptor xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08net, xdp: Introduce xdp_init_buff utility routineLorenzo Bianconi
Introduce xdp_init_buff utility routine to initialize xdp_buff fields const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on xdp_init_buff in all XDP capable drivers. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Shay Agroskin <shayagr@amazon.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Camelia Groza <camelia.groza@nxp.com> Acked-by: Marcin Wojtas <mw@semihalf.com> Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08bpf: Replace fput with sockfd_put in sock mapZheng Yongjun
The function sockfd_lookup uses fget on the value that is stored in the file field of the returned structure, so fput should ultimately be applied to this value. This can be done directly, but it seems better to use the specific macro sockfd_put, which does the same thing. The cleanup was done using the following semantic patch: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @@ expression s; @@ s = sockfd_lookup(...) ... + sockfd_put(s); ?- fput(s->file); // </smpl> Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Trivial conflict in CAN on file rename. Conflicts: drivers/net/can/m_can/tcan4x5x-core.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08cfg80211: Save the regulatory domain with a lockIlan Peer
Saving the regulatory domain while setting custom regulatory domain was done while accessing a RCU protected pointer but without any protection. Fix this by using RTNL while accessing the pointer. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Reported-by: syzbot+27771d4abcd9b7a1f5d3@syzkaller.appspotmail.com Reported-by: syzbot+db4035751c56c0079282@syzkaller.appspotmail.com Reported-by: Hans de Goede <hdegoede@redhat.com> Fixes: beee24695157 ("cfg80211: Save the regulatory domain when setting custom regulatory") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Link: https://lore.kernel.org/r/iwlwifi.20210105165657.613e9a876829.Ia38d27dbebea28bf9c56d70691d243186ede70e7@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-08xfrm: interface: enable TSO on xfrm interfacesEyal Birger
Underlying xfrm output supports gso packets. Declare support in hw_features and adapt the xmit MTU check to pass GSO packets. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-07nexthop: Bounce NHA_GATEWAY in FDB nexthop groupsPetr Machata
The function nh_check_attr_group() is called to validate nexthop groups. The intention of that code seems to have been to bounce all attributes above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all these attributes except when NHA_FDB attribute is present--then it accepts them. NHA_FDB validation that takes place before, in rtm_to_nh_config(), already bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further back, NHA_GROUPS and NHA_MASTER are bounced unconditionally. But that still leaves NHA_GATEWAY as an attribute that would be accepted in FDB nexthop groups (with no meaning), so long as it keeps the address family as unspecified: # ip nexthop add id 1 fdb via 127.0.0.1 # ip nexthop add id 10 fdb via default group 1 The nexthop code is still relatively new and likely not used very broadly, and the FDB bits are newer still. Even though there is a reproducer out there, it relies on an improbable gateway arguments "via default", "via all" or "via any". Given all this, I believe it is OK to reformulate the condition to do the right thing and bounce NHA_GATEWAY. Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops") Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07nexthop: Unlink nexthop group entry in error pathIdo Schimmel
In case of error, remove the nexthop group entry from the list to which it was previously added. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07nexthop: Fix off-by-one error in error pathIdo Schimmel
A reference was not taken for the current nexthop entry, so do not try to put it in the error path. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}Jonathan Lemon
Unlike the rest of the skb_zcopy_ functions, these routines operate on a 'struct ubuf', not a skb. Remove the 'skb_' prefix from the naming to make things clearer. Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: add flags to ubuf_info for ubuf setupJonathan Lemon
Currently, when an ubuf is attached to a new skb, the shared flags word is initialized to a fixed value. Instead of doing this, set the default flags in the ubuf, and have new skbs inherit from this default. This is needed when setting up different zerocopy types. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: group skb_shinfo zerocopy related bits together.Jonathan Lemon
In preparation for expanded zerocopy (TX and RX), move the zerocopy related bits out of tx_flags into their own flag word. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: rename sock_zerocopy_* to msg_zerocopy_*Jonathan Lemon
At Willem's suggestion, rename the sock_zerocopy_* functions so that they match the MSG_ZEROCOPY flag, which makes it clear they are specific to this zerocopy implementation. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Call skb_zcopy_clear() before unref'ing fragmentsJonathan Lemon
RX zerocopy fragment pages which are not allocated from the system page pool require special handling. Give the callback in skb_zcopy_clear() a chance to process them first. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abortJonathan Lemon
The sock_zerocopy_put_abort function contains logic which is specific to the current zerocopy implementation. Add a wrapper which checks the callback and dispatches apppropriately. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Add skb parameter to the ubuf zerocopy callbackJonathan Lemon
Add an optional skb parameter to the zerocopy callback parameter, which is passed down from skb_zcopy_clear(). This gives access to the original skb, which is needed for upcoming RX zero-copy error handling. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: replace sock_zerocopy_get with skb_zcopy_getJonathan Lemon
Rename the get routines for consistency. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: replace sock_zerocopy_put() with skb_zcopy_put()Jonathan Lemon
Replace sock_zerocopy_put with the generic skb_zcopy_put() function. Pass 'true' as the success argument, as this is identical to no change. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Push status and refcounts into sock_zerocopy_callbackJonathan Lemon
Before this change, the caller of sock_zerocopy_callback would need to save the zerocopy status, decrement and check the refcount, and then call the callback function - the callback was only invoked when the refcount reached zero. Now, the caller just passes the status into the callback function, which saves the status and handles its own refcounts. This makes the behavior of the sock_zerocopy_callback identical to the tpacket and vhost callbacks. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: simplify sock_zerocopy_putJonathan Lemon
All 'struct ubuf_info' users should have a callback defined as of commit 0a4a060bb204 ("sock: fix zerocopy_success regression with msg_zerocopy"). Remove the dead code path to consume_skb(), which makes assumptions about how the structure was allocated. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: remove the DSA specific notifiersVladimir Oltean
This effectively reverts commit 60724d4bae14 ("net: dsa: Add support for DSA specific notifiers"). The reason is that since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings"), it appears that there is a generic way to achieve the same purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was converted to use the generic notifiers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: export dsa_slave_dev_checkVladimir Oltean
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when they are enslaved to e.g. a bridge by calling netif_is_bridge_master(). Export this helper from DSA to get the equivalent functionality of determining whether the upper interface of a CHANGEUPPER notifier is a DSA switch interface or not. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: move the Broadcom tag information in a separate header fileVladimir Oltean
It is a bit strange to see something as specific as Broadcom SYSTEMPORT bits in the main DSA include file. Move these away into a separate header, and have the tagger and the SYSTEMPORT driver include them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge ↵Vladimir Oltean
neighbors Some DSA switches (and not only) cannot learn source MAC addresses from packets injected from the CPU. They only perform hardware address learning from inbound traffic. This can be problematic when we have a bridge spanning some DSA switch ports and some non-DSA ports (which we'll call "foreign interfaces" from DSA's perspective). There are 2 classes of problems created by the lack of learning on CPU-injected traffic: - excessive flooding, due to the fact that DSA treats those addresses as unknown - the risk of stale routes, which can lead to temporary packet loss To illustrate the second class, consider the following situation, which is common in production equipment (wireless access points, where there is a WLAN interface and an Ethernet switch, and these form a single bridging domain). AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ ^ | | | | | | | Client A Client B | | | +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 will know that Clients A and B are reachable via wlan0 - the hardware fdb of a DSA switch driver today is not kept in sync with the software entries on other bridge ports, so it will not know that clients A and B are reachable via the CPU port UNLESS the hardware switch itself performs SA learning from traffic injected from the CPU. Nonetheless, a substantial number of switches don't. - the hardware fdb of the DSA switch on AP 2 may autonomously learn that Client A and B are reachable through swp0. Therefore, the software br0 of AP 2 also may or may not learn this. In the example we're illustrating, some Ethernet traffic has been going on, and br0 from AP 2 has indeed learnt that it can reach Client B through swp0. One of the wireless clients, say Client B, disconnects from AP 1 and roams to AP 2. The topology now looks like this: AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ | | | Client A | | | Client B | | | v +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 still knows that Client A is reachable via wlan0 (no change) - br0 of AP 1 will (possibly) know that Client B has left wlan0. There are cases where it might never find out though. Either way, DSA today does not process that notification in any way. - the hardware FDB of the DSA switch on AP 1 may learn autonomously that Client B can be reached via swp0, if it receives any packet with Client 1's source MAC address over Ethernet. - the hardware FDB of the DSA switch on AP 2 still thinks that Client B can be reached via swp0. It does not know that it has roamed to wlan0, because it doesn't perform SA learning from the CPU port. Now Client A contacts Client B. AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet segment. AP 2 sees a frame on swp0 and its fdb says that the destination is swp0. Hairpinning is disabled => drop. This problem comes from the fact that these switches have a 'blind spot' for addresses coming from software bridging. The generic solution is not to assume that hardware learning can be enabled somehow, but to listen to more bridge learning events. It turns out that the bridge driver does learn in software from all inbound frames, in __br_handle_local_finish. A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the addresses serviced by the bridge on 'foreign' interfaces. The software bridge also does the right thing on migration, by notifying that the old entry is deleted, so that does not need to be special-cased in DSA. When it is deleted, we just need to delete our static FDB entry towards the CPU too, and wait. The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE events received on its own interfaces, such as static FDB entries. Luckily we can change that, and DSA can listen to all switchdev FDB add/del events in the system and figure out if those events were emitted by a bridge that spans at least one of DSA's own ports. In case that is true, DSA will also offload that address towards its own CPU port, in the eventuality that there might be bridge clients attached to the DSA switch who want to talk to the station connected to the foreign interface. In terms of implementation, we need to keep the fdb_info->added_by_user check for the case where the switchdev event was targeted directly at a DSA switch port. But we don't need to look at that flag for snooped events. So the check is currently too late, we need to move it earlier. This also simplifies the code a bit, since we avoid uselessly allocating and freeing switchdev_work. We could probably do some improvements in the future. For example, multi-bridge support is rudimentary at the moment. If there are two bridges spanning a DSA switch's ports, and both of them need to service the same MAC address, then what will happen is that the migration of one of those stations will trigger the deletion of the FDB entry from the CPU port while it is still used by other bridge. That could be improved with reference counting but is left for another time. This behavior needs to be enabled at driver level by setting ds->assisted_learning_on_cpu_port = true. This is because we don't want to inflict a potential performance penalty (accesses through MDIO/I2C/SPI are expensive) to hardware that really doesn't need it because address learning on the CPU port works there. Reported-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDBVladimir Oltean
Right now, the following would happen for a switch driver that does not implement .port_fdb_add or .port_fdb_del. dsa_slave_switchdev_event returns NOTIFY_OK and schedules: -> dsa_slave_switchdev_event_work -> dsa_port_fdb_add -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD) -> dsa_switch_fdb_add -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP; -> an error is printed with dev_dbg, and dsa_fdb_offload_notify(switchdev_work) is not called. We can avoid scheduling the worker for nothing and say NOTIFY_DONE. Because we don't call dsa_fdb_offload_notify, the static FDB entry will remain just in the software bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: move switchdev event implementation under the same switch/case ↵Vladimir Oltean
statement We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events even for interfaces where dsa_slave_dev_check returns false, so we need that check inside the switch-case statement for SWITCHDEV_FDB_*. This movement also avoids a useless allocation / free of switchdev_work on the untreated "default event" case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_workVladimir Oltean
Currently DSA doesn't add FDB entries on the CPU port, because it only does so through switchdev, which is associated with a net_device, and there are none of those for the CPU port. But actually FDB addresses on the CPU port have some use cases of their own, if the switchdev operations are initiated from within the DSA layer. There is just one problem with the existing code: it passes a structure in dsa_switchdev_event_work which was retrieved directly from switchdev, so it contains a net_device. We need to generalize the contents to something that covers the CPU port as well: the "ds, port" tuple is fine for that. Note that the new procedure for notifying the successful FDB offload is inspired from the rocker model. Also, nothing was being done if added_by_user was false. Let's check for that a lot earlier, and don't actually bother to schedule the worker for nothing. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: be louder when a non-legacy FDB operation failsVladimir Oltean
The dev_close() call was added in commit c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification") "to indicate inconsistent situation" when we could not delete an FDB entry from the port. bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master It is a bit drastic and at the same time not helpful if the above fails to only print with netdev_dbg log level, but on the other hand to bring the interface down. So increase the verbosity of the error message, and drop dev_close(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: bridge: notify switchdev of disappearance of old FDB entry upon migrationVladimir Oltean
Currently the bridge emits atomic switchdev notifications for dynamically learnt FDB entries. Monitoring these notifications works wonders for switchdev drivers that want to keep their hardware FDB in sync with the bridge's FDB. For example station A wants to talk to station B in the diagram below, and we are concerned with the behavior of the bridge on the DUT device: DUT +-------------------------------------+ | br0 | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | eth0 | | +-------------------------------------+ | | | Station A | | | | +--+------+--+ +--+------+--+ | | | | | | | | | | swp0 | | | | swp0 | | Another | +------+ | | +------+ | Another switch | br0 | | br0 | switch | +------+ | | +------+ | | | | | | | | | | | swp1 | | | | swp1 | | +--+------+--+ +--+------+--+ | Station B Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has the following property: frames injected from its control interface bypass the internal address analyzer logic, and therefore, this hardware does not learn from the source address of packets transmitted by the network stack through it. So, since bridging between eth0 (where Station B is attached) and swp0 (where Station A is attached) is done in software, the switchdev hardware will never learn the source address of Station B. So the traffic towards that destination will be treated as unknown, i.e. flooded. This is where the bridge notifications come in handy. When br0 on the DUT sees frames with Station B's MAC address on eth0, the switchdev driver gets these notifications and can install a rule to send frames towards Station B's address that are incoming from swp0, swp1, swp2, only towards the control interface. This is all switchdev driver private business, which the notification makes possible. All is fine until someone unplugs Station B's cable and moves it to the other switch: DUT +-------------------------------------+ | br0 | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | eth0 | | +-------------------------------------+ | | | Station A | | | | +--+------+--+ +--+------+--+ | | | | | | | | | | swp0 | | | | swp0 | | Another | +------+ | | +------+ | Another switch | br0 | | br0 | switch | +------+ | | +------+ | | | | | | | | | | | swp1 | | | | swp1 | | +--+------+--+ +--+------+--+ | Station B Luckily for the use cases we care about, Station B is noisy enough that the DUT hears it (on swp1 this time). swp1 receives the frames and delivers them to the bridge, who enters the unlikely path in br_fdb_update of updating an existing entry. It moves the entry in the software bridge to swp1 and emits an addition notification towards that. As far as the switchdev driver is concerned, all that it needs to ensure is that traffic between Station A and Station B is not forever broken. If it does nothing, then the stale rule to send frames for Station B towards the control interface remains in place. But Station B is no longer reachable via the control interface, but via a port that can offload the bridge port learning attribute. It's just that the port is prevented from learning this address, since the rule overrides FDB updates. So the rule needs to go. The question is via what mechanism. It sure would be possible for this switchdev driver to keep track of all addresses which are sent to the control interface, and then also listen for bridge notifier events on its own ports, searching for the ones that have a MAC address which was previously sent to the control interface. But this is cumbersome and inefficient. Instead, with one small change, the bridge could notify of the address deletion from the old port, in a symmetrical manner with how it did for the insertion. Then the switchdev driver would not be required to monitor learn/forget events for its own ports. It could just delete the rule towards the control interface upon bridge entry migration. This would make hardware address learning be possible again. Then it would take a few more packets until the hardware and software FDB would be in sync again. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: ip: always refragment ip defragmented packetsFlorian Westphal
Conntrack reassembly records the largest fragment size seen in IPCB. However, when this gets forwarded/transmitted, fragmentation will only be forced if one of the fragmented packets had the DF bit set. In that case, a flag in IPCB will force fragmentation even if the MTU is large enough. This should work fine, but this breaks with ip tunnels. Consider client that sends a UDP datagram of size X to another host. The client fragments the datagram, so two packets, of size y and z, are sent. DF bit is not set on any of these packets. Middlebox netfilter reassembles those packets back to single size-X packet, before routing decision. packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit isn't set. At output time, ip refragmentation is skipped as well because x is still smaller than the mtu of the output device. If ttransmit device is an ip tunnel, the packet size increases to x+overhead. Also, tunnel might be configured to force DF bit on outer header. In this case, packet will be dropped (exceeds MTU) and an ICMP error is generated back to sender. But sender already respects the announced MTU, all the packets that it sent did fit the announced mtu. Force refragmentation as per original sizes unconditionally so ip tunnel will encapsulate the fragments instead. The only other solution I see is to place ip refragmentation in the ip_tunnel code to handle this case. Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet") Reported-by: Christian Perle <christian.perle@secunet.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: fix pmtu check in nopmtudisc modeFlorian Westphal
For some reason ip_tunnel insist on setting the DF bit anyway when the inner header has the DF bit set, EVEN if the tunnel was configured with 'nopmtudisc'. This means that the script added in the previous commit cannot be made to work by adding the 'nopmtudisc' flag to the ip tunnel configuration. Doing so breaks connectivity even for the without-conntrack/netfilter scenario. When nopmtudisc is set, the tunnel will skip the mtu check, so no icmp error is sent to client. Then, because inner header has DF set, the outer header gets added with DF bit set as well. IP stack then sends an error to itself because the packet exceeds the device MTU. Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.") Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checksJakub Kicinski
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into udp_tunnel_nic_*_port() helpers, since they're always done right before the call. Add similar checks before calling the notifier. udp_tunnel_nic invokes the notifier without checking features which could result in some wasted cycles. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpersJakub Kicinski
All drivers use udp_tunnel_nic_*_port() helpers, prepare for NDO removal by invoking those helpers directly. The helpers are safe to call on all devices, they check if device has the UDP tunnel state initialized. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: ipv6: fib: flush exceptions when purging routeSean Tranchetti
Route removal is handled by two code paths. The main removal path is via fib6_del_route() which will handle purging any PMTU exceptions from the cache, removing all per-cpu copies of the DST entry used by the route, and releasing the fib6_info struct. The second removal location is during fib6_add_rt2node() during a route replacement operation. This path also calls fib6_purge_rt() to handle cleaning up the per-cpu copies of the DST entries and releasing the fib6_info associated with the older route, but it does not flush any PMTU exceptions that the older route had. Since the older route is removed from the tree during the replacement, we lose any way of accessing it again. As these lingering DSTs and the fib6_info struct are holding references to the underlying netdevice struct as well, unregistering that device from the kernel can never complete. Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache") Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07Merge tag 'linux-can-next-for-5.12-20210106' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2021-01-06 The first 16 patches are by me and target the tcan4x5x SPI glue driver for the m_can CAN driver. First there are a several cleanup commits, then the SPI regmap part is converted to 8 bits per word, to make it possible to use that driver on SPI controllers that only support the 8 bit per word mode (such as the SPI cores on the raspberry pi). Oliver Hartkopp contributes a patch for the CAN_RAW protocol. The getsockopt() for CAN_RAW_FILTER is changed to return -ERANGE if the filterset does not fit into the provided user space buffer. The last two patches are by Joakim Zhang and add wakeup support to the flexcan driver for the i.MX8QM SoC. The dt-bindings docs are extended to describe the added property. * tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: flexcan: add CAN wakeup function for i.MX8QM dt-bindings: can: fsl,flexcan: add fsl,scu-index property to indicate a resource can: raw: return -ERANGE when filterset does not fit into user space buffer can: tcan4x5x: add support for half-duplex controllers can: tcan4x5x: rework SPI access can: tcan4x5x: add {wr,rd}_table can: tcan4x5x: add max_raw_{read,write} of 256 can: tcan4x5x: tcan4x5x_regmap: set reg_stride to 4 can: tcan4x5x: fix max register value can: tcan4x5x: tcan4x5x_regmap_init(): use spi as context pointer can: tcan4x5x: tcan4x5x_regmap_write(): remove not needed casts and replace 4 by sizeof can: tcan4x5x: rename regmap_spi_gather_write() -> tcan4x5x_regmap_gather_write() can: tcan4x5x: remove regmap async support can: tcan4x5x: tcan4x5x_bus: remove not needed read_flag_mask can: tcan4x5x: mark struct regmap_bus tcan4x5x_bus as constant can: tcan4x5x: move regmap code into seperate file can: tcan4x5x: rename tcan4x5x.c -> tcan4x5x-core.c can: tcan4x5x: beautify indention of tcan4x5x_of_match and tcan4x5x_id_table can: tcan4x5x: replace DEVICE_NAME by KBUILD_MODNAME ==================== Link: https://lore.kernel.org/r/20210107094900.173046-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06net: dsa: print error on invalid port indexRafał Miłecki
Looking for an -EINVAL all over the dsa code could take hours for inexperienced DSA users. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20210106090915.21439-1-zajec5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06can: raw: return -ERANGE when filterset does not fit into user space bufferOliver Hartkopp
Multiple filters (struct can_filter) can be set with the setsockopt() function, which was originally intended as a write-only operation. As getsockopt() also provides a CAN_RAW_FILTER option to read back the given filters, the caller has to provide an appropriate user space buffer. In the case this buffer is too small the getsockopt() silently truncates the filter information and gives no information about the needed space. This is safe but not convenient for the programmer. In net/core/sock.c the SO_PEERGROUPS sockopt had a similar requirement and solved it by returning -ERANGE in the case that the provided data does not fit into the given user space buffer and fills the required size into optlen, so that the caller can retry with a matching buffer length. This patch adopts this approach for CAN_RAW_FILTER getsockopt(). Reported-by: Phillip Schichtel <phillip@schich.tel> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-By: Phillip Schichtel <phillip@schich.tel> Link: https://lore.kernel.org/r/20201216174928.21663-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-06Bluetooth: avoid u128_xor() on potentially misaligned inputsArd Biesheuvel
u128_xor() takes pointers to quantities that are assumed to be at least 64-bit aligned, which is not guaranteed to be the case in the smp_c1() routine. So switch to crypto_xor() instead. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-05net: qrtr: fix null-ptr-deref in qrtr_ns_removeQinglang Miao
A null-ptr-deref bug is reported by Hulk Robot like this: -------------- KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f] Call Trace: qrtr_ns_remove+0x22/0x40 [ns] qrtr_proto_fini+0xa/0x31 [qrtr] __x64_sys_delete_module+0x337/0x4e0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x468ded -------------- When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would be called later on would raise a null-ptr-deref because qrtr_ns.workqueue has been destroyed. Fix it by making qrtr_ns_init have a return value and adding a check in qrtr_proto_init. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: vlan: avoid leaks on register_vlan_dev() failuresJakub Kicinski
VLAN checks for NETREG_UNINITIALIZED to distinguish between registration failure and unregistration in progress. Since commit cb626bf566eb ("net-sysfs: Fix reference count leak") registration failure may, however, result in NETREG_UNREGISTERED as well as NETREG_UNINITIALIZED. This fix is similer to cebb69754f37 ("rtnetlink: Fix memory(net_device) leak when ->newlink fails") Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: nfc: nci: Change the NCI close sequenceBongsu Jeon
If there is a NCI command in work queue after closing the NCI device at nci_unregister_device, The NCI command timer starts at flush_workqueue function and then NCI command timeout handler would be called 5 second after flushing the NCI command work queue and destroying the queue. At that time, the timeout handler would try to use NCI command work queue that is destroyed already. it will causes the problem. To avoid this abnormal situation, change the sequence to prevent the NCI command timeout handler from being called after destroying the NCI command work queue. Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: kcm: Replace fput with sockfd_putZheng Yongjun
The function sockfd_lookup uses fget on the value that is stored in the file field of the returned structure, so fput should ultimately be applied to this value. This can be done directly, but it seems better to use the specific macro sockfd_put, which does the same thing. Perform a source code refactoring by using the following semantic patch. // <smpl> @@ expression s; @@ s = sockfd_lookup(...) ... + sockfd_put(s); - fput(s->file); // </smpl> Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05cfg80211: select CONFIG_CRC32Arnd Bergmann
Without crc32 support, this fails to link: arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz': scan.c:(.text+0x928): undefined reference to `crc32_le' Fixes: c8cb5b854b40 ("nl80211/cfg80211: support 6 GHz scanning") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: tipc: Replace expression with offsetof()Zheng Yongjun
Use the existing offsetof() macro instead of duplicating code. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>