summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-01-08xfrm: interface: enable TSO on xfrm interfacesEyal Birger
Underlying xfrm output supports gso packets. Declare support in hw_features and adapt the xmit MTU check to pass GSO packets. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-07nexthop: Bounce NHA_GATEWAY in FDB nexthop groupsPetr Machata
The function nh_check_attr_group() is called to validate nexthop groups. The intention of that code seems to have been to bounce all attributes above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all these attributes except when NHA_FDB attribute is present--then it accepts them. NHA_FDB validation that takes place before, in rtm_to_nh_config(), already bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further back, NHA_GROUPS and NHA_MASTER are bounced unconditionally. But that still leaves NHA_GATEWAY as an attribute that would be accepted in FDB nexthop groups (with no meaning), so long as it keeps the address family as unspecified: # ip nexthop add id 1 fdb via 127.0.0.1 # ip nexthop add id 10 fdb via default group 1 The nexthop code is still relatively new and likely not used very broadly, and the FDB bits are newer still. Even though there is a reproducer out there, it relies on an improbable gateway arguments "via default", "via all" or "via any". Given all this, I believe it is OK to reformulate the condition to do the right thing and bounce NHA_GATEWAY. Fixes: 38428d68719c ("nexthop: support for fdb ecmp nexthops") Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07nexthop: Unlink nexthop group entry in error pathIdo Schimmel
In case of error, remove the nexthop group entry from the list to which it was previously added. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07nexthop: Fix off-by-one error in error pathIdo Schimmel
A reference was not taken for the current nexthop entry, so do not try to put it in the error path. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}Jonathan Lemon
Unlike the rest of the skb_zcopy_ functions, these routines operate on a 'struct ubuf', not a skb. Remove the 'skb_' prefix from the naming to make things clearer. Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: add flags to ubuf_info for ubuf setupJonathan Lemon
Currently, when an ubuf is attached to a new skb, the shared flags word is initialized to a fixed value. Instead of doing this, set the default flags in the ubuf, and have new skbs inherit from this default. This is needed when setting up different zerocopy types. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: group skb_shinfo zerocopy related bits together.Jonathan Lemon
In preparation for expanded zerocopy (TX and RX), move the zerocopy related bits out of tx_flags into their own flag word. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: rename sock_zerocopy_* to msg_zerocopy_*Jonathan Lemon
At Willem's suggestion, rename the sock_zerocopy_* functions so that they match the MSG_ZEROCOPY flag, which makes it clear they are specific to this zerocopy implementation. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Call skb_zcopy_clear() before unref'ing fragmentsJonathan Lemon
RX zerocopy fragment pages which are not allocated from the system page pool require special handling. Give the callback in skb_zcopy_clear() a chance to process them first. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abortJonathan Lemon
The sock_zerocopy_put_abort function contains logic which is specific to the current zerocopy implementation. Add a wrapper which checks the callback and dispatches apppropriately. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Add skb parameter to the ubuf zerocopy callbackJonathan Lemon
Add an optional skb parameter to the zerocopy callback parameter, which is passed down from skb_zcopy_clear(). This gives access to the original skb, which is needed for upcoming RX zero-copy error handling. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: replace sock_zerocopy_get with skb_zcopy_getJonathan Lemon
Rename the get routines for consistency. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: replace sock_zerocopy_put() with skb_zcopy_put()Jonathan Lemon
Replace sock_zerocopy_put with the generic skb_zcopy_put() function. Pass 'true' as the success argument, as this is identical to no change. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: Push status and refcounts into sock_zerocopy_callbackJonathan Lemon
Before this change, the caller of sock_zerocopy_callback would need to save the zerocopy status, decrement and check the refcount, and then call the callback function - the callback was only invoked when the refcount reached zero. Now, the caller just passes the status into the callback function, which saves the status and handles its own refcounts. This makes the behavior of the sock_zerocopy_callback identical to the tpacket and vhost callbacks. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07skbuff: simplify sock_zerocopy_putJonathan Lemon
All 'struct ubuf_info' users should have a callback defined as of commit 0a4a060bb204 ("sock: fix zerocopy_success regression with msg_zerocopy"). Remove the dead code path to consume_skb(), which makes assumptions about how the structure was allocated. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: remove the DSA specific notifiersVladimir Oltean
This effectively reverts commit 60724d4bae14 ("net: dsa: Add support for DSA specific notifiers"). The reason is that since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings"), it appears that there is a generic way to achieve the same purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was converted to use the generic notifiers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: export dsa_slave_dev_checkVladimir Oltean
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when they are enslaved to e.g. a bridge by calling netif_is_bridge_master(). Export this helper from DSA to get the equivalent functionality of determining whether the upper interface of a CHANGEUPPER notifier is a DSA switch interface or not. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: move the Broadcom tag information in a separate header fileVladimir Oltean
It is a bit strange to see something as specific as Broadcom SYSTEMPORT bits in the main DSA include file. Move these away into a separate header, and have the tagger and the SYSTEMPORT driver include them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge ↵Vladimir Oltean
neighbors Some DSA switches (and not only) cannot learn source MAC addresses from packets injected from the CPU. They only perform hardware address learning from inbound traffic. This can be problematic when we have a bridge spanning some DSA switch ports and some non-DSA ports (which we'll call "foreign interfaces" from DSA's perspective). There are 2 classes of problems created by the lack of learning on CPU-injected traffic: - excessive flooding, due to the fact that DSA treats those addresses as unknown - the risk of stale routes, which can lead to temporary packet loss To illustrate the second class, consider the following situation, which is common in production equipment (wireless access points, where there is a WLAN interface and an Ethernet switch, and these form a single bridging domain). AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ ^ | | | | | | | Client A Client B | | | +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 will know that Clients A and B are reachable via wlan0 - the hardware fdb of a DSA switch driver today is not kept in sync with the software entries on other bridge ports, so it will not know that clients A and B are reachable via the CPU port UNLESS the hardware switch itself performs SA learning from traffic injected from the CPU. Nonetheless, a substantial number of switches don't. - the hardware fdb of the DSA switch on AP 2 may autonomously learn that Client A and B are reachable through swp0. Therefore, the software br0 of AP 2 also may or may not learn this. In the example we're illustrating, some Ethernet traffic has been going on, and br0 from AP 2 has indeed learnt that it can reach Client B through swp0. One of the wireless clients, say Client B, disconnects from AP 1 and roams to AP 2. The topology now looks like this: AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ | | | Client A | | | Client B | | | v +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 still knows that Client A is reachable via wlan0 (no change) - br0 of AP 1 will (possibly) know that Client B has left wlan0. There are cases where it might never find out though. Either way, DSA today does not process that notification in any way. - the hardware FDB of the DSA switch on AP 1 may learn autonomously that Client B can be reached via swp0, if it receives any packet with Client 1's source MAC address over Ethernet. - the hardware FDB of the DSA switch on AP 2 still thinks that Client B can be reached via swp0. It does not know that it has roamed to wlan0, because it doesn't perform SA learning from the CPU port. Now Client A contacts Client B. AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet segment. AP 2 sees a frame on swp0 and its fdb says that the destination is swp0. Hairpinning is disabled => drop. This problem comes from the fact that these switches have a 'blind spot' for addresses coming from software bridging. The generic solution is not to assume that hardware learning can be enabled somehow, but to listen to more bridge learning events. It turns out that the bridge driver does learn in software from all inbound frames, in __br_handle_local_finish. A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the addresses serviced by the bridge on 'foreign' interfaces. The software bridge also does the right thing on migration, by notifying that the old entry is deleted, so that does not need to be special-cased in DSA. When it is deleted, we just need to delete our static FDB entry towards the CPU too, and wait. The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE events received on its own interfaces, such as static FDB entries. Luckily we can change that, and DSA can listen to all switchdev FDB add/del events in the system and figure out if those events were emitted by a bridge that spans at least one of DSA's own ports. In case that is true, DSA will also offload that address towards its own CPU port, in the eventuality that there might be bridge clients attached to the DSA switch who want to talk to the station connected to the foreign interface. In terms of implementation, we need to keep the fdb_info->added_by_user check for the case where the switchdev event was targeted directly at a DSA switch port. But we don't need to look at that flag for snooped events. So the check is currently too late, we need to move it earlier. This also simplifies the code a bit, since we avoid uselessly allocating and freeing switchdev_work. We could probably do some improvements in the future. For example, multi-bridge support is rudimentary at the moment. If there are two bridges spanning a DSA switch's ports, and both of them need to service the same MAC address, then what will happen is that the migration of one of those stations will trigger the deletion of the FDB entry from the CPU port while it is still used by other bridge. That could be improved with reference counting but is left for another time. This behavior needs to be enabled at driver level by setting ds->assisted_learning_on_cpu_port = true. This is because we don't want to inflict a potential performance penalty (accesses through MDIO/I2C/SPI are expensive) to hardware that really doesn't need it because address learning on the CPU port works there. Reported-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDBVladimir Oltean
Right now, the following would happen for a switch driver that does not implement .port_fdb_add or .port_fdb_del. dsa_slave_switchdev_event returns NOTIFY_OK and schedules: -> dsa_slave_switchdev_event_work -> dsa_port_fdb_add -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD) -> dsa_switch_fdb_add -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP; -> an error is printed with dev_dbg, and dsa_fdb_offload_notify(switchdev_work) is not called. We can avoid scheduling the worker for nothing and say NOTIFY_DONE. Because we don't call dsa_fdb_offload_notify, the static FDB entry will remain just in the software bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: move switchdev event implementation under the same switch/case ↵Vladimir Oltean
statement We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events even for interfaces where dsa_slave_dev_check returns false, so we need that check inside the switch-case statement for SWITCHDEV_FDB_*. This movement also avoids a useless allocation / free of switchdev_work on the untreated "default event" case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_workVladimir Oltean
Currently DSA doesn't add FDB entries on the CPU port, because it only does so through switchdev, which is associated with a net_device, and there are none of those for the CPU port. But actually FDB addresses on the CPU port have some use cases of their own, if the switchdev operations are initiated from within the DSA layer. There is just one problem with the existing code: it passes a structure in dsa_switchdev_event_work which was retrieved directly from switchdev, so it contains a net_device. We need to generalize the contents to something that covers the CPU port as well: the "ds, port" tuple is fine for that. Note that the new procedure for notifying the successful FDB offload is inspired from the rocker model. Also, nothing was being done if added_by_user was false. Let's check for that a lot earlier, and don't actually bother to schedule the worker for nothing. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: be louder when a non-legacy FDB operation failsVladimir Oltean
The dev_close() call was added in commit c9eb3e0f8701 ("net: dsa: Add support for learning FDB through notification") "to indicate inconsistent situation" when we could not delete an FDB entry from the port. bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master It is a bit drastic and at the same time not helpful if the above fails to only print with netdev_dbg log level, but on the other hand to bring the interface down. So increase the verbosity of the error message, and drop dev_close(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: bridge: notify switchdev of disappearance of old FDB entry upon migrationVladimir Oltean
Currently the bridge emits atomic switchdev notifications for dynamically learnt FDB entries. Monitoring these notifications works wonders for switchdev drivers that want to keep their hardware FDB in sync with the bridge's FDB. For example station A wants to talk to station B in the diagram below, and we are concerned with the behavior of the bridge on the DUT device: DUT +-------------------------------------+ | br0 | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | eth0 | | +-------------------------------------+ | | | Station A | | | | +--+------+--+ +--+------+--+ | | | | | | | | | | swp0 | | | | swp0 | | Another | +------+ | | +------+ | Another switch | br0 | | br0 | switch | +------+ | | +------+ | | | | | | | | | | | swp1 | | | | swp1 | | +--+------+--+ +--+------+--+ | Station B Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has the following property: frames injected from its control interface bypass the internal address analyzer logic, and therefore, this hardware does not learn from the source address of packets transmitted by the network stack through it. So, since bridging between eth0 (where Station B is attached) and swp0 (where Station A is attached) is done in software, the switchdev hardware will never learn the source address of Station B. So the traffic towards that destination will be treated as unknown, i.e. flooded. This is where the bridge notifications come in handy. When br0 on the DUT sees frames with Station B's MAC address on eth0, the switchdev driver gets these notifications and can install a rule to send frames towards Station B's address that are incoming from swp0, swp1, swp2, only towards the control interface. This is all switchdev driver private business, which the notification makes possible. All is fine until someone unplugs Station B's cable and moves it to the other switch: DUT +-------------------------------------+ | br0 | | +------+ +------+ +------+ +------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | eth0 | | +-------------------------------------+ | | | Station A | | | | +--+------+--+ +--+------+--+ | | | | | | | | | | swp0 | | | | swp0 | | Another | +------+ | | +------+ | Another switch | br0 | | br0 | switch | +------+ | | +------+ | | | | | | | | | | | swp1 | | | | swp1 | | +--+------+--+ +--+------+--+ | Station B Luckily for the use cases we care about, Station B is noisy enough that the DUT hears it (on swp1 this time). swp1 receives the frames and delivers them to the bridge, who enters the unlikely path in br_fdb_update of updating an existing entry. It moves the entry in the software bridge to swp1 and emits an addition notification towards that. As far as the switchdev driver is concerned, all that it needs to ensure is that traffic between Station A and Station B is not forever broken. If it does nothing, then the stale rule to send frames for Station B towards the control interface remains in place. But Station B is no longer reachable via the control interface, but via a port that can offload the bridge port learning attribute. It's just that the port is prevented from learning this address, since the rule overrides FDB updates. So the rule needs to go. The question is via what mechanism. It sure would be possible for this switchdev driver to keep track of all addresses which are sent to the control interface, and then also listen for bridge notifier events on its own ports, searching for the ones that have a MAC address which was previously sent to the control interface. But this is cumbersome and inefficient. Instead, with one small change, the bridge could notify of the address deletion from the old port, in a symmetrical manner with how it did for the insertion. Then the switchdev driver would not be required to monitor learn/forget events for its own ports. It could just delete the rule towards the control interface upon bridge entry migration. This would make hardware address learning be possible again. Then it would take a few more packets until the hardware and software FDB would be in sync again. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: ip: always refragment ip defragmented packetsFlorian Westphal
Conntrack reassembly records the largest fragment size seen in IPCB. However, when this gets forwarded/transmitted, fragmentation will only be forced if one of the fragmented packets had the DF bit set. In that case, a flag in IPCB will force fragmentation even if the MTU is large enough. This should work fine, but this breaks with ip tunnels. Consider client that sends a UDP datagram of size X to another host. The client fragments the datagram, so two packets, of size y and z, are sent. DF bit is not set on any of these packets. Middlebox netfilter reassembles those packets back to single size-X packet, before routing decision. packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit isn't set. At output time, ip refragmentation is skipped as well because x is still smaller than the mtu of the output device. If ttransmit device is an ip tunnel, the packet size increases to x+overhead. Also, tunnel might be configured to force DF bit on outer header. In this case, packet will be dropped (exceeds MTU) and an ICMP error is generated back to sender. But sender already respects the announced MTU, all the packets that it sent did fit the announced mtu. Force refragmentation as per original sizes unconditionally so ip tunnel will encapsulate the fragments instead. The only other solution I see is to place ip refragmentation in the ip_tunnel code to handle this case. Fixes: d6b915e29f4ad ("ip_fragment: don't forward defragmented DF packet") Reported-by: Christian Perle <christian.perle@secunet.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: fix pmtu check in nopmtudisc modeFlorian Westphal
For some reason ip_tunnel insist on setting the DF bit anyway when the inner header has the DF bit set, EVEN if the tunnel was configured with 'nopmtudisc'. This means that the script added in the previous commit cannot be made to work by adding the 'nopmtudisc' flag to the ip tunnel configuration. Doing so breaks connectivity even for the without-conntrack/netfilter scenario. When nopmtudisc is set, the tunnel will skip the mtu check, so no icmp error is sent to client. Then, because inner header has DF set, the outer header gets added with DF bit set as well. IP stack then sends an error to itself because the packet exceeds the device MTU. Fixes: 23a3647bc4f93 ("ip_tunnels: Use skb-len to PMTU check.") Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checksJakub Kicinski
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into udp_tunnel_nic_*_port() helpers, since they're always done right before the call. Add similar checks before calling the notifier. udp_tunnel_nic invokes the notifier without checking features which could result in some wasted cycles. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpersJakub Kicinski
All drivers use udp_tunnel_nic_*_port() helpers, prepare for NDO removal by invoking those helpers directly. The helpers are safe to call on all devices, they check if device has the UDP tunnel state initialized. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: ipv6: fib: flush exceptions when purging routeSean Tranchetti
Route removal is handled by two code paths. The main removal path is via fib6_del_route() which will handle purging any PMTU exceptions from the cache, removing all per-cpu copies of the DST entry used by the route, and releasing the fib6_info struct. The second removal location is during fib6_add_rt2node() during a route replacement operation. This path also calls fib6_purge_rt() to handle cleaning up the per-cpu copies of the DST entries and releasing the fib6_info associated with the older route, but it does not flush any PMTU exceptions that the older route had. Since the older route is removed from the tree during the replacement, we lose any way of accessing it again. As these lingering DSTs and the fib6_info struct are holding references to the underlying netdevice struct as well, unregistering that device from the kernel can never complete. Fixes: 2b760fcf5cfb3 ("ipv6: hook up exception table to store dst cache") Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07Merge tag 'linux-can-next-for-5.12-20210106' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2021-01-06 The first 16 patches are by me and target the tcan4x5x SPI glue driver for the m_can CAN driver. First there are a several cleanup commits, then the SPI regmap part is converted to 8 bits per word, to make it possible to use that driver on SPI controllers that only support the 8 bit per word mode (such as the SPI cores on the raspberry pi). Oliver Hartkopp contributes a patch for the CAN_RAW protocol. The getsockopt() for CAN_RAW_FILTER is changed to return -ERANGE if the filterset does not fit into the provided user space buffer. The last two patches are by Joakim Zhang and add wakeup support to the flexcan driver for the i.MX8QM SoC. The dt-bindings docs are extended to describe the added property. * tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: flexcan: add CAN wakeup function for i.MX8QM dt-bindings: can: fsl,flexcan: add fsl,scu-index property to indicate a resource can: raw: return -ERANGE when filterset does not fit into user space buffer can: tcan4x5x: add support for half-duplex controllers can: tcan4x5x: rework SPI access can: tcan4x5x: add {wr,rd}_table can: tcan4x5x: add max_raw_{read,write} of 256 can: tcan4x5x: tcan4x5x_regmap: set reg_stride to 4 can: tcan4x5x: fix max register value can: tcan4x5x: tcan4x5x_regmap_init(): use spi as context pointer can: tcan4x5x: tcan4x5x_regmap_write(): remove not needed casts and replace 4 by sizeof can: tcan4x5x: rename regmap_spi_gather_write() -> tcan4x5x_regmap_gather_write() can: tcan4x5x: remove regmap async support can: tcan4x5x: tcan4x5x_bus: remove not needed read_flag_mask can: tcan4x5x: mark struct regmap_bus tcan4x5x_bus as constant can: tcan4x5x: move regmap code into seperate file can: tcan4x5x: rename tcan4x5x.c -> tcan4x5x-core.c can: tcan4x5x: beautify indention of tcan4x5x_of_match and tcan4x5x_id_table can: tcan4x5x: replace DEVICE_NAME by KBUILD_MODNAME ==================== Link: https://lore.kernel.org/r/20210107094900.173046-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06net: dsa: print error on invalid port indexRafał Miłecki
Looking for an -EINVAL all over the dsa code could take hours for inexperienced DSA users. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20210106090915.21439-1-zajec5@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06can: raw: return -ERANGE when filterset does not fit into user space bufferOliver Hartkopp
Multiple filters (struct can_filter) can be set with the setsockopt() function, which was originally intended as a write-only operation. As getsockopt() also provides a CAN_RAW_FILTER option to read back the given filters, the caller has to provide an appropriate user space buffer. In the case this buffer is too small the getsockopt() silently truncates the filter information and gives no information about the needed space. This is safe but not convenient for the programmer. In net/core/sock.c the SO_PEERGROUPS sockopt had a similar requirement and solved it by returning -ERANGE in the case that the provided data does not fit into the given user space buffer and fills the required size into optlen, so that the caller can retry with a matching buffer length. This patch adopts this approach for CAN_RAW_FILTER getsockopt(). Reported-by: Phillip Schichtel <phillip@schich.tel> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-By: Phillip Schichtel <phillip@schich.tel> Link: https://lore.kernel.org/r/20201216174928.21663-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-06Bluetooth: avoid u128_xor() on potentially misaligned inputsArd Biesheuvel
u128_xor() takes pointers to quantities that are assumed to be at least 64-bit aligned, which is not guaranteed to be the case in the smp_c1() routine. So switch to crypto_xor() instead. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-05net: qrtr: fix null-ptr-deref in qrtr_ns_removeQinglang Miao
A null-ptr-deref bug is reported by Hulk Robot like this: -------------- KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f] Call Trace: qrtr_ns_remove+0x22/0x40 [ns] qrtr_proto_fini+0xa/0x31 [qrtr] __x64_sys_delete_module+0x337/0x4e0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x468ded -------------- When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would be called later on would raise a null-ptr-deref because qrtr_ns.workqueue has been destroyed. Fix it by making qrtr_ns_init have a return value and adding a check in qrtr_proto_init. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: vlan: avoid leaks on register_vlan_dev() failuresJakub Kicinski
VLAN checks for NETREG_UNINITIALIZED to distinguish between registration failure and unregistration in progress. Since commit cb626bf566eb ("net-sysfs: Fix reference count leak") registration failure may, however, result in NETREG_UNREGISTERED as well as NETREG_UNINITIALIZED. This fix is similer to cebb69754f37 ("rtnetlink: Fix memory(net_device) leak when ->newlink fails") Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: nfc: nci: Change the NCI close sequenceBongsu Jeon
If there is a NCI command in work queue after closing the NCI device at nci_unregister_device, The NCI command timer starts at flush_workqueue function and then NCI command timeout handler would be called 5 second after flushing the NCI command work queue and destroying the queue. At that time, the timeout handler would try to use NCI command work queue that is destroyed already. it will causes the problem. To avoid this abnormal situation, change the sequence to prevent the NCI command timeout handler from being called after destroying the NCI command work queue. Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: kcm: Replace fput with sockfd_putZheng Yongjun
The function sockfd_lookup uses fget on the value that is stored in the file field of the returned structure, so fput should ultimately be applied to this value. This can be done directly, but it seems better to use the specific macro sockfd_put, which does the same thing. Perform a source code refactoring by using the following semantic patch. // <smpl> @@ expression s; @@ s = sockfd_lookup(...) ... + sockfd_put(s); - fput(s->file); // </smpl> Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05cfg80211: select CONFIG_CRC32Arnd Bergmann
Without crc32 support, this fails to link: arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz': scan.c:(.text+0x928): undefined reference to `crc32_le' Fixes: c8cb5b854b40 ("nl80211/cfg80211: support 6 GHz scanning") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05net: tipc: Replace expression with offsetof()Zheng Yongjun
Use the existing offsetof() macro instead of duplicating code. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05Merge tag 'net-5.11-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from netfilter, wireless and bpf trees. Current release - regressions: - mt76: fix NULL pointer dereference in mt76u_status_worker and mt76s_process_tx_queue - net: ipa: fix interconnect enable bug Current release - always broken: - netfilter: fixes possible oops in mtype_resize in ipset - ath11k: fix number of coding issues found by static analysis tools and spurious error messages Previous releases - regressions: - e1000e: re-enable s0ix power saving flows for systems with the Intel i219-LM Ethernet controllers to fix power use regression - virtio_net: fix recursive call to cpus_read_lock() to avoid a deadlock - ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst() - sysfs: take the rtnl lock around XPS configuration - xsk: fix memory leak for failed bind and rollback reservation at NETDEV_TX_BUSY - r8169: work around power-saving bug on some chip versions Previous releases - always broken: - dcb: validate netlink message in DCB handler - tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS to prevent unnecessary retries - vhost_net: fix ubuf refcount when sendmsg fails - bpf: save correct stopping point in file seq iteration - ncsi: use real net-device for response handler - neighbor: fix div by zero caused by a data race (TOCTOU) - bareudp: fix use of incorrect min_headroom size and a false positive lockdep splat from the TX lock - mvpp2: - clear force link UP during port init procedure in case bootloader had set it - add TCAM entry to drop flow control pause frames - fix PPPoE with ipv6 packet parsing - fix GoP Networking Complex Control config of port 3 - fix pkt coalescing IRQ-threshold configuration - xsk: fix race in SKB mode transmit with shared cq - ionic: account for vlan tag len in rx buffer len - stmmac: ignore the second clock input, current clock framework does not handle exclusive clock use well, other drivers may reconfigure the second clock Misc: - ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow existing scheme" * tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits) net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event r8169: work around power-saving bug on some chip versions net: usb: qmi_wwan: add Quectel EM160R-GL selftests: mlxsw: Set headroom size of correct port net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag ibmvnic: fix: NULL pointer dereference. docs: networking: packet_mmap: fix old config reference docs: networking: packet_mmap: fix formatting for C macros vhost_net: fix ubuf refcount incorrectly when sendmsg fails bareudp: Fix use of incorrect min_headroom size bareudp: set NETIF_F_LLTX flag net: hdlc_ppp: Fix issues when mod_timer is called while timer is running atlantic: remove architecture depends erspan: fix version 1 check in gre_parse_header() net: hns: fix return value check in __lb_other_process() net: sched: prevent invalid Scell_log shift count net: neighbor: fix a crash caused by mod zero ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() ...
2021-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Missing sanitization of rateest userspace string, bug has been triggered by syzbot, patch from Florian Westphal. 2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise error reporting to userspace via EINVAL is misleading since this is reserved for malformed netlink requests. 3) New binaries with old kernels might silently accept several set element expressions. New binaries set on the NFT_SET_EXPR and NFT_DYNSET_F_EXPR flags to request for several expressions per element, hence old kernels which do not support for this bail out with EOPNOTSUPP. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nftables: add set expression flags netfilter: nft_dynset: report EOPNOTSUPP on missing set feature netfilter: xt_RATEEST: reject non-null terminated string from userspace ==================== Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-04net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_eventXie He
In lapb_device_event, lapb_devtostruct is called to get a reference to an object of "struct lapb_cb". lapb_devtostruct increases the refcount of the object and returns a pointer to it. However, we didn't decrease the refcount after we finished using the pointer. This patch fixes this problem. Fixes: a4989fa91110 ("net/lapb: support netdev events") Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-04libceph, ceph: disambiguate ceph_connection_operations handlersIlya Dryomov
Since a few years, kernel addresses are no longer included in oops dumps, at least on x86. All we get is a symbol name with offset and size. This is a problem for ceph_connection_operations handlers, especially con->ops->dispatch(). All three handlers have the same name and there is little context to disambiguate between e.g. monitor and OSD clients because almost everything is inlined. gdb sneakily stops at the first matching symbol, so one has to resort to nm and addr2line. Some of these are already prefixed with mon_, osd_ or mds_. Let's do the same for all others. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org>
2021-01-04libceph: zero out session key and connection secretIlya Dryomov
Try and avoid leaving bits and pieces of session key and connection secret (gets split into GCM key and a pair of GCM IVs) around. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-01-04xfrm: Fix wraparound in xfrm_policy_addr_delta()Visa Hankala
Use three-way comparison for address components to avoid integer wraparound in the result of xfrm_policy_addr_delta(). This ensures that the search trees are built and traversed correctly. Treat IPv4 and IPv6 similarly by returning 0 when prefixlen == 0. Prefix /0 has only one equivalence class. Fixes: 9cf545ebd591d ("xfrm: policy: store inexact policies in a tree ordered by destination address") Signed-off-by: Visa Hankala <visa@hankala.org> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-04af_key: relax availability checks for skb size calculationCong Wang
xfrm_probe_algs() probes kernel crypto modules and changes the availability of struct xfrm_algo_desc. But there is a small window where ealg->available and aalg->available get changed between count_ah_combs()/count_esp_combs() and dump_ah_combs()/dump_esp_combs(), in this case we may allocate a smaller skb but later put a larger amount of data and trigger the panic in skb_put(). Fix this by relaxing the checks when counting the size, that is, skipping the test of ->available. We may waste some memory for a few of sizeof(struct sadb_comb), but it is still much better than a panic. Reported-by: syzbot+b2bf2652983d23734c5c@syzkaller.appspotmail.com Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-04xfrm: fix disable_xfrm sysctl when used on xfrm interfacesEyal Birger
The disable_xfrm flag signals that xfrm should not be performed during routing towards a device before reaching device xmit. For xfrm interfaces this is usually desired as they perform the outbound policy lookup as part of their xmit using their if_id. Before this change enabling this flag on xfrm interfaces prevented them from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup in case the original dst had the DST_NOXFRM flag. This optimization is incorrect when the lookup is done by the xfrm interface xmit logic. Fix by performing policy lookup when invoked by xfrmi as if_id != 0. Similarly it's unlikely for the 'no policy exists on net' check to yield any performance benefits when invoked from xfrmi. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-12-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-12-28 The following pull-request contains BPF updates for your *net* tree. There is a small merge conflict between bpf tree commit 69ca310f3416 ("bpf: Save correct stopping point in file seq iteration") and net tree commit 66ed594409a1 ("bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to the error path: [...] curr_task = task_seq_get_next(ns, &curr_tid, true); if (!curr_task) { info->task = NULL; info->tid = curr_tid; return NULL; } /* set info->task and info->tid */ [...] We've added 10 non-merge commits during the last 9 day(s) which contain a total of 11 files changed, 75 insertions(+), 20 deletions(-). The main changes are: 1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson. 2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling point to BPF hashtab initialization, from Eric Dumazet. 3) Fix a splat in task iterator by saving the correct stopping point in the seq file iteration, from Jonathan Lemon. 4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY errors on update/deletes, from Andrii Nakryiko. 5) Fix BPF selftest error reporting to something more user friendly if the vmlinux BTF cannot be found, from Kamal Mostafa. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28erspan: fix version 1 check in gre_parse_header()Cong Wang
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not have an erspan header. So the check in gre_parse_header() is wrong, we have to distinguish version 1 from version 0. We can just check the gre header length like is_erspan_type1(). Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup") Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com Cc: William Tu <u9012063@gmail.com> Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28net: sched: prevent invalid Scell_log shift countRandy Dunlap
Check Scell_log shift size in red_check_params() and modify all callers of red_check_params() to pass Scell_log. This prevents a shift out-of-bounds as detected by UBSAN: UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22 shift exponent 72 is too large for 32-bit type 'int' Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com Cc: Nogah Frankel <nogahf@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>