summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-09-21net/mlx5e: TLS, Do not expose FPGA TLS counter if not supportedTariq Toukan
The set of TLS TX global SW counters in mlx5e_tls_sw_stats_desc is updated from all rings by using atomic ops. This set of stats is used only in the FPGA TLS use case, not in the Connect-X TLS one, where regular per-ring counters are used. Do not expose them in the Connect-X use case, as this would cause counter duplication. For example, tx_tls_drop_no_sync_data would appear twice in the ethtool stats. Fixes: d2ead1f360e8 ("net/mlx5e: Add kTLS TX HW offload support") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats()Alaa Hleihel
The cited commit started to reuse function mlx5e_update_ndo_stats() for the representors as well. However, the function is hard-coded to work on mlx5e_nic_stats_grps only. Due to this issue, the representors statistics were not updated in the output of "ip -s". Fix it to work with the correct group by extracting it from the caller's profile. Also, while at it and since this function became generic, move it to en_stats.c and rename it accordingly. Fixes: 8a236b15144b ("net/mlx5e: Convert rep stats to mlx5e_stats_grp-based infra") Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Fix multicast counter not up-to-date in "ip -s"Ron Diskin
Currently the FW does not generate events for counters other than error counters. Unlike ".get_ethtool_stats", ".ndo_get_stats64" (which ip -s uses) might run in atomic context, while the FW interface is non atomic. Thus, 'ip' is not allowed to issue FW commands, so it will only display cached counters in the driver. Add a SW counter (mcast_packets) in the driver to count rx multicast packets. The counter also counts broadcast packets, as we consider it a special case of multicast. Use the counter value when calling "ip -s"/"ifconfig". Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality") Signed-off-by: Ron Diskin <rondi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Fix endianness when calculating pedit mask first bitMaor Dickman
The field mask value is provided in network byte order and has to be converted to host byte order before calculating pedit mask first bit. Fixes: 88f30bbcbaaa ("net/mlx5e: Bit sized fields rewrite support") Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Enable adding peer miss rules only if merged eswitch is supportedMaor Dickman
The cited commit creates peer miss group during switchdev mode initialization in order to handle miss packets correctly while in VF LAG mode. This is done regardless of FW support of such groups which could cause rules setups failure later on. Fix by adding FW capability check before creating peer groups/rule. Fixes: ac004b832128 ("net/mlx5e: E-Switch, Add peer miss rules") Signed-off-by: Maor Dickman <maord@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Raed Salem <raeds@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: CT: Fix freeing ct_label mappingRoi Dayan
Add missing mapping remove call when removing ct rule, as the mapping was allocated when ct rule was adding with ct_label. Also there is a missing mapping remove call in error flow. Fixes: 54b154ecfb8c ("net/mlx5e: CT: Map 128 bits labels to 32 bit map ID") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Fix memory leak of tunnel info when rule under multipath not readyJianbo Liu
When deleting vxlan flow rule under multipath, tun_info in parse_attr is not freed when the rule is not ready. Fixes: ef06c9ee8933 ("net/mlx5e: Allow one failure when offloading tc encap rules under multipath") Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Use synchronize_rcu to sync with NAPIMaxim Mikityanskiy
As described in the previous commit, napi_synchronize doesn't quite fit the purpose when we just need to wait until the currently running NAPI quits. Its implementation waits until NAPI is not running by polling and waiting for 1ms in between. In cases where we need to deactivate one queue (e.g., recovery flows) or where we deactivate them one-by-one (deactivate channel flow), we may get stuck in napi_synchronize forever if other queues keep NAPI active, causing a soft lockup. Depending on kernel configuration (CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC), it may result in a kernel panic. To fix the issue, use synchronize_rcu to wait for NAPI to quit, and wrap the whole NAPI in rcu_read_lock. Fixes: acc6c5953af1 ("net/mlx5e: Split open/close channels to stages") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5e: Use RCU to protect rq->xdp_progMaxim Mikityanskiy
Currently, the RQs are temporarily deactivated while hot-replacing the XDP program, and napi_synchronize is used to make sure rq->xdp_prog is not in use. However, napi_synchronize is not ideal: instead of waiting till the end of a NAPI cycle, it polls and waits until NAPI is not running, sleeping for 1ms between the periodic checks. Under heavy workloads, this loop will never end, which may even lead to a kernel panic if the kernel detects the hangup. Such workloads include XSK TX and possibly also heavy RX (XSK or normal). The fix is inspired by commit 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock"). As mlx5e_xdp_handle is already protected by rcu_read_lock, and bpf_prog_put uses call_rcu to free the program, there is no need for additional synchronization if proper RCU functions are used to access the pointer. This patch converts all accesses to rq->xdp_prog to use RCU functions. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Fixes: db05815b36cb ("net/mlx5e: Add XSK zero-copy support") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-09-21net/mlx5: Fix FTE cleanupMaor Gottlieb
Currently, when an FTE is allocated, its refcount is decreased to 0 with the purpose it will not be a stand alone steering object and every rule (destination) of the FTE would increase the refcount. When mlx5_cleanup_fs is called while not all rules were deleted by the steering users, it hit refcount underflow on the FTE once clean_tree calls to tree_remove_node after the deleted rules already decreased the refcount to 0. FTE is no longer destroyed implicitly when the last rule (destination) is deleted. mlx5_del_flow_rules avoids it by increasing the refcount on the FTE and destroy it explicitly after all rules were deleted. So we can avoid the refcount underflow by making FTE as stand alone object. In addition need to set del_hw_func to FTE so the HW object will be destroyed when the FTE is deleted from the cleanup_tree flow. refcount_t: underflow; use-after-free. WARNING: CPU: 2 PID: 15715 at lib/refcount.c:28 refcount_warn_saturate+0xd9/0xe0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: tree_put_node+0xf2/0x140 [mlx5_core] clean_tree+0x4e/0xf0 [mlx5_core] clean_tree+0x4e/0xf0 [mlx5_core] clean_tree+0x4e/0xf0 [mlx5_core] clean_tree+0x5f/0xf0 [mlx5_core] clean_tree+0x4e/0xf0 [mlx5_core] clean_tree+0x5f/0xf0 [mlx5_core] mlx5_cleanup_fs+0x26/0x270 [mlx5_core] mlx5_unload+0x2e/0xa0 [mlx5_core] mlx5_unload_one+0x51/0x120 [mlx5_core] mlx5_devlink_reload_down+0x51/0x90 [mlx5_core] devlink_reload+0x39/0x120 ? devlink_nl_cmd_reload+0x43/0x220 genl_rcv_msg+0x1e4/0x420 ? genl_family_rcv_msg_attrs_parse+0x100/0x100 netlink_rcv_skb+0x47/0x110 genl_rcv+0x24/0x40 netlink_unicast+0x217/0x2f0 netlink_sendmsg+0x30f/0x430 sock_sendmsg+0x30/0x40 __sys_sendto+0x10e/0x140 ? handle_mm_fault+0xc4/0x1f0 ? do_page_fault+0x33f/0x630 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x48/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes") Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in parallel") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-17nfp: use correct define to return NONE fecJakub Kicinski
struct ethtool_fecparam carries bitmasks not bit numbers. We want to return 1 (NONE), not 0. Fixes: 0d0870938337 ("nfp: implement ethtool FEC mode settings") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17hinic: fix potential resource leakWei Li
In rx_request_irq(), it will just return what irq_set_affinity_hint() returns. If it is failed, the napi and irq requested are not freed properly. So add exits for failures to handle these. Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17Merge branch 'net-phy-Unbind-fixes'David S. Miller
Florian Fainelli says: ==================== net: phy: Unbind fixes This patch series fixes a couple of issues with the unbinding of the PHY drivers and then bringing down a network interface. The first is a NULL pointer de-reference and the second was an incorrect warning being triggered. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17net: phy: Do not warn in phy_stop() on PHY_DOWNFlorian Fainelli
When phy_is_started() was added to catch incorrect PHY states, phy_stop() would not be qualified against PHY_DOWN. It is possible to reach that state when the PHY driver has been unbound and the network device is then brought down. Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17net: phy: Avoid NPD upon phy_detach() when driver is unboundFlorian Fainelli
If we have unbound the PHY driver prior to calling phy_detach() (often via phy_disconnect()) then we can cause a NULL pointer de-reference accessing the driver owner member. The steps to reproduce are: echo unimac-mdio-0:01 > /sys/class/net/eth0/phydev/driver/unbind ip link set eth0 down Fixes: cafe8df8b9bc ("net: phy: Fix lack of reference count on PHY driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17ethtool: add and use message type for tunnel info replyMichal Kubecek
Tunnel offload info code uses ETHTOOL_MSG_TUNNEL_INFO_GET message type (cmd field in genetlink header) for replies to tunnel info netlink request, i.e. the same value as the request have. This is a problem because we are using two separate enums for userspace to kernel and kernel to userspace message types so that this ETHTOOL_MSG_TUNNEL_INFO_GET (28) collides with ETHTOOL_MSG_CABLE_TEST_TDR_NTF which is what message type 28 means for kernel to userspace messages. As the tunnel info request reached mainline in 5.9 merge window, we should still be able to fix the reply message type without breaking backward compatibility. Fixes: c7d759eb7b12 ("ethtool: add tunnel info interface") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17drivers/net/wan/hdlc: Set skb->protocol before transmittingXie He
This patch sets skb->protocol before transmitting frames on the HDLC device, so that a user listening on the HDLC device with an AF_PACKET socket will see outgoing frames' sll_protocol field correctly set and consistent with that of incoming frames. 1. Control frames in hdlc_cisco and hdlc_ppp When these drivers send control frames, skb->protocol is not set. This value should be set to htons(ETH_P_HDLC), because when receiving control frames, their skb->protocol is set to htons(ETH_P_HDLC). When receiving, hdlc_type_trans in hdlc.h is called, which then calls cisco_type_trans or ppp_type_trans. The skb->protocol of control frames is set to htons(ETH_P_HDLC) so that the control frames can be received by hdlc_rcv in hdlc.c, which calls cisco_rx or ppp_rx to process the control frames. 2. hdlc_fr When this driver sends control frames, skb->protocol is set to internal values used in this driver. When this driver sends data frames (from upper stacked PVC devices), skb->protocol is the same as that of the user data packet being sent on the upper PVC device (for normal PVC devices), or is htons(ETH_P_802_3) (for Ethernet-emulating PVC devices). However, skb->protocol for both control frames and data frames should be set to htons(ETH_P_HDLC), because when receiving, all frames received on the HDLC device will have their skb->protocol set to htons(ETH_P_HDLC). When receiving, hdlc_type_trans in hdlc.h is called, and because this driver doesn't provide a type_trans function in struct hdlc_proto, all frames will have their skb->protocol set to htons(ETH_P_HDLC). The frames are then received by hdlc_rcv in hdlc.c, which calls fr_rx to process the frames (control frames are consumed and data frames are re-received on upper PVC devices). Cc: Krzysztof Halasa <khc@pm.waw.pl> Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17drivers/net/wan/lapbether: Make skb->protocol consistent with the headerXie He
This driver is a virtual driver stacked on top of Ethernet interfaces. When this driver transmits data on the Ethernet device, the skb->protocol setting is inconsistent with the Ethernet header prepended to the skb. This causes a user listening on the Ethernet interface with an AF_PACKET socket, to see different sll_protocol values for incoming and outgoing frames, because incoming frames would have this value set by parsing the Ethernet header. This patch changes the skb->protocol value for outgoing Ethernet frames, making it consistent with the Ethernet header prepended. This makes a user listening on the Ethernet device with an AF_PACKET socket, to see the same sll_protocol value for incoming and outgoing frames. Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17cxgb4: fix memory leak during module unloadRaju Rangoju
Fix the memory leak in mps during module unload path by freeing mps reference entries if the list adpter->mps_ref is not already empty Fixes: 28b3870578ef ("cxgb4: Re-work the logic for mps refcounting") Signed-off-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17hv_netvsc: Add validation for untrusted Hyper-V valuesAndres Beltran
For additional robustness in the face of Hyper-V errors or malicious behavior, validate all values that originate from packets that Hyper-V has sent to the guest in the host-to-guest ring buffer. Ensure that invalid values cannot cause indexing off the end of an array, or subvert an existing validation via integer overflow. Ensure that outgoing packets do not have any leftover guest memory that has not been zeroed out. Signed-off-by: Andres Beltran <lkmlabelt@gmail.com> Co-developed-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-16net: dsa: microchip: ksz8795: really set the correct number of portsMatthias Schiffer
The KSZ9477 and KSZ8795 use the port_cnt field differently: For the KSZ9477, it includes the CPU port(s), while for the KSZ8795, it doesn't. It would be a good cleanup to make the handling of both drivers match, but as a first step, fix the recently broken assignment of num_ports in the KSZ8795 driver (which completely broke probing, as the CPU port index was always failing the num_ports check). Fixes: af199a1a9cb0 ("net: dsa: microchip: set the correct number of ports") Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> Reviewed-by: Codrin Ciubotariu <codrin.ciubotariu@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-16geneve: add transport ports in route lookup for geneveMark Gray
This patch adds transport ports information for route lookup so that IPsec can select Geneve tunnel traffic to do encryption. This is needed for OVS/OVN IPsec with encrypted Geneve tunnels. This can be tested by configuring a host-host VPN using an IKE daemon and specifying port numbers. For example, for an Openswan-type configuration, the following parameters should be configured on both hosts and IPsec set up as-per normal: $ cat /etc/ipsec.conf conn in ... left=$IP1 right=$IP2 ... leftprotoport=udp/6081 rightprotoport=udp ... conn out ... left=$IP1 right=$IP2 ... leftprotoport=udp rightprotoport=udp/6081 ... The tunnel can then be setup using "ip" on both hosts (but changing the relevant IP addresses): $ ip link add tun type geneve id 1000 remote $IP2 $ ip addr add 192.168.0.1/24 dev tun $ ip link set tun up This can then be tested by pinging from $IP1: $ ping 192.168.0.2 Without this patch the traffic is unencrypted on the wire. Fixes: 2d07dc79fe04 ("geneve: add initial netdev driver for GENEVE tunnels") Signed-off-by: Qiuyu Xiao <qiuyu.xiao.qyx@gmail.com> Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-16net: hns: kerneldoc fixesLu Wei
Fix some parameter description or spelling mistakes. Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-09-15 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 19 day(s) which contain a total of 10 files changed, 47 insertions(+), 38 deletions(-). The main changes are: 1) docs/bpf fixes, from Andrii. 2) ld_abs fix, from Daniel. 3) socket casting helpers fix, from Martin. 4) hash iterator fixes, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15bpf: Fix a rcu warning for bpffs map pretty-printYonghong Song
Running selftest ./btf_btf -p the kernel had the following warning: [ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300 [ 51.529217] Modules linked in: [ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878 [ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014 [ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300 ... [ 51.542826] Call Trace: [ 51.543119] map_seq_next+0x53/0x80 [ 51.543528] seq_read+0x263/0x400 [ 51.543932] vfs_read+0xad/0x1c0 [ 51.544311] ksys_read+0x5f/0xe0 [ 51.544689] do_syscall_64+0x33/0x40 [ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The related source code in kernel/bpf/hashtab.c: 709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key) 710 { 711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 712 struct hlist_nulls_head *head; 713 struct htab_elem *l, *next_l; 714 u32 hash, key_size; 715 int i = 0; 716 717 WARN_ON_ONCE(!rcu_read_lock_held()); In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key() without holding a rcu_read_lock(), hence causing the above warning. To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock. Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com
2020-09-15bpf: Bpf_skc_to_* casting helpers require a NULL check on skMartin KaFai Lau
The bpf_skc_to_* type casting helpers are available to BPF_PROG_TYPE_TRACING. The traced PTR_TO_BTF_ID may be NULL. For example, the skb->sk may be NULL. Thus, these casting helpers need to check "!sk" also and this patch fixes them. Fixes: 0d4fad3e57df ("bpf: Add bpf_skc_to_udp6_sock() helper") Fixes: 478cfbdf5f13 ("bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers") Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200915182959.241101-1-kafai@fb.com
2020-09-15ipv4: Update exception handling for multipath routes via same deviceDavid Ahern
Kfir reported that pmtu exceptions are not created properly for deployments where multipath routes use the same device. After some digging I see 2 compounding problems: 1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after* the route lookup. This is the second use case where this has been a problem (the first is related to use of vti devices with VRF). I can not find any reason for the oif to be changed after the lookup; the code goes back to the start of git. It does not seem logical so remove it. 2. fib_lookups for exceptions do not call fib_select_path to handle multipath route selection based on the hash. The end result is that the fib_lookup used to add the exception always creates it based using the first leg of the route. An example topology showing the problem: | host1 +------+ | eth0 | .209 +------+ | +------+ switch | br0 | +------+ | +---------+---------+ | host2 | host3 +------+ +------+ | eth0 | .250 | eth0 | 192.168.252.252 +------+ +------+ +-----+ +-----+ | vti | .2 | vti | 192.168.247.3 +-----+ +-----+ \ / ================================= tunnels 192.168.247.1/24 for h in host1 host2 host3; do ip netns add ${h} ip -netns ${h} link set lo up ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1 done ip netns add switch ip -netns switch li set lo up ip -netns switch link add br0 type bridge stp 0 ip -netns switch link set br0 up for n in 1 2 3; do ip -netns switch link add eth-sw type veth peer name eth-h${n} ip -netns switch li set eth-h${n} master br0 up ip -netns switch li set eth-sw netns host${n} name eth0 done ip -netns host1 addr add 192.168.252.209/24 dev eth0 ip -netns host1 link set dev eth0 up ip -netns host1 route add 192.168.247.0/24 \ nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0 ip -netns host2 addr add 192.168.252.250/24 dev eth0 ip -netns host2 link set dev eth0 up ip -netns host2 addr add 192.168.252.252/24 dev eth0 ip -netns host3 link set dev eth0 up ip netns add tunnel ip -netns tunnel li set lo up ip -netns tunnel li add br0 type bridge ip -netns tunnel li set br0 up for n in $(seq 11 20); do ip -netns tunnel addr add dev br0 192.168.247.${n}/24 done for n in 2 3 do ip -netns tunnel link add vti${n} type veth peer name eth${n} ip -netns tunnel link set eth${n} mtu 1360 master br0 up ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24 done ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3 ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11 ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15 ip -netns host1 ro ls cache Before this patch the cache always shows exceptions against the first leg in the multipath route; 192.168.252.250 per this example. Since the hash has an initial random seed, you may need to vary the final octet more than what is listed. In my tests, using addresses between 11 and 19 usually found 1 that used both legs. With this patch, the cache will have exceptions for both legs. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions") Reported-by: Kfir Itzhak <mastertheknife@gmail.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15net: tipc: kerneldoc fixesLu Wei
Fix parameter description of tipc_link_bc_create() Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 16ad3f4022bb ("tipc: introduce variable window congestion control") Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15ibmvnic: update MAINTAINERSDany Madden
Update supporters for IBM Power SRIOV Virtual NIC Device Driver. Thomas Falcon is moving on to other works. Dany Madden, Lijun Pan and Sukadev Bhattiprolu are the current supporters. Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14docs/bpf: Remove source code linksAndrii Nakryiko
Make path to bench_ringbufs.c just a text, not a special link. Fixes: 97abb2b39682 ("docs/bpf: Add BPF ring buffer design notes") Reported-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200915005031.2748397-1-andriin@fb.com
2020-09-14xsk: Fix number of pinned pages/umem size discrepancyBjörn Töpel
For AF_XDP sockets, there was a discrepancy between the number of of pinned pages and the size of the umem region. The size of the umem region is used to validate the AF_XDP descriptor addresses. The logic that pinned the pages covered by the region only took whole pages into consideration, creating a mismatch between the size and pinned pages. A user could then pass AF_XDP addresses outside the range of pinned pages, but still within the size of the region, crashing the kernel. This change correctly calculates the number of pages to be pinned. Further, the size check for the aligned mode is simplified. Now the code simply checks if the size is divisible by the chunk size. Fixes: bbff2f321a86 ("xsk: new descriptor addressing scheme") Reported-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com
2020-09-14net: sched: initialize with 0 before setting erspan md->uXin Long
In fl_set_erspan_opt(), all bits of erspan md was set 1, as this function is also used to set opt MASK. However, when setting for md->u.index for opt VALUE, the rest bits of the union md->u will be left 1. It would cause to fail the match of the whole md when version is 1 and only index is set. This patch is to fix by initializing with 0 before setting erspan md->u. Reported-by: Shuang Li <shuali@redhat.com> Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge branch 'net-improve-vxlan-option-process-in-net_sched-and-lwtunnel'David S. Miller
Xin Long says: ==================== net: improve vxlan option process in net_sched and lwtunnel This patch is to do some mask when setting vxlan option in net_sched and lwtunnel, so that only available bits can be set on vxlan md gbp. This would help when users don't know exactly vxlan's gbp bits, and avoid some mismatch because of some unavailable bits set by users. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14lwtunnel: only keep the available bits when setting vxlan md->gbpXin Long
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can be set to or parse from the packet for vxlan gbp option. So do the mask when set it in lwtunnel, as it does in act_tunnel_key and cls_flower. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: sched: only keep the available bits when setting vxlan md->gbpXin Long
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can be set to or parse from the packet for vxlan gbp option. So we'd better do the mask when set it in act_tunnel_key and cls_flower. Otherwise, when users don't know these bits, they may configure with a value which can never be matched. Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14tipc: use skb_unshare() instead in tipc_buf_append()Xin Long
In tipc_buf_append() it may change skb's frag_list, and it causes problems when this skb is cloned. skb_unclone() doesn't really make this skb's flag_list available to change. Shuang Li has reported an use-after-free issue because of this when creating quite a few macvlan dev over the same dev, where the broadcast packets will be cloned and go up to the stack: [ ] BUG: KASAN: use-after-free in pskb_expand_head+0x86d/0xea0 [ ] Call Trace: [ ] dump_stack+0x7c/0xb0 [ ] print_address_description.constprop.7+0x1a/0x220 [ ] kasan_report.cold.10+0x37/0x7c [ ] check_memory_region+0x183/0x1e0 [ ] pskb_expand_head+0x86d/0xea0 [ ] process_backlog+0x1df/0x660 [ ] net_rx_action+0x3b4/0xc90 [ ] [ ] Allocated by task 1786: [ ] kmem_cache_alloc+0xbf/0x220 [ ] skb_clone+0x10a/0x300 [ ] macvlan_broadcast+0x2f6/0x590 [macvlan] [ ] macvlan_process_broadcast+0x37c/0x516 [macvlan] [ ] process_one_work+0x66a/0x1060 [ ] worker_thread+0x87/0xb10 [ ] [ ] Freed by task 3253: [ ] kmem_cache_free+0x82/0x2a0 [ ] skb_release_data+0x2c3/0x6e0 [ ] kfree_skb+0x78/0x1d0 [ ] tipc_recvmsg+0x3be/0xa40 [tipc] So fix it by using skb_unshare() instead, which would create a new skb for the cloned frag and it'll be safe to change its frag_list. The similar things were also done in sctp_make_reassembled_event(), which is using skb_copy(). Reported-by: Shuang Li <shuali@redhat.com> Fixes: 37e22164a8a3 ("tipc: rename and move message reassembly function") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14tipc: Fix memory leak in tipc_group_create_member()Peilin Ye
tipc_group_add_to_tree() returns silently if `key` matches `nkey` of an existing node, causing tipc_group_create_member() to leak memory. Let tipc_group_add_to_tree() return an error in such a case, so that tipc_group_create_member() can handle it properly. Fixes: 75da2163dbb6 ("tipc: introduce communication groups") Reported-and-tested-by: syzbot+f95d90c454864b3b5bc9@syzkaller.appspotmail.com Cc: Hillf Danton <hdanton@sina.com> Link: https://syzkaller.appspot.com/bug?id=048390604fe1b60df34150265479202f10e13aff Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14ipv4: Initialize flowi4_multipath_hash in data pathDavid Ahern
flowi4_multipath_hash was added by the commit referenced below for tunnels. Unfortunately, the patch did not initialize the new field for several fast path lookups that do not initialize the entire flow struct to 0. Fix those locations. Currently, flowi4_multipath_hash is random garbage and affects the hash value computed by fib_multipath_hash for multipath selection. Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14Merge branch 'net-lantiq-Fix-bugs-in-NAPI-handling'David S. Miller
Hauke Mehrtens says: ==================== net: lantiq: Fix bugs in NAPI handling This fixes multiple bugs in the NAPI handling. Changes since: v1: - removed stable tag from "net: lantiq: use netif_tx_napi_add() for TX NAPI" - Check the NAPI budged in "net: lantiq: Use napi_complete_done()" - Add extra fix "net: lantiq: Disable IRQs only if NAPI gets scheduled" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Disable IRQs only if NAPI gets scheduledHauke Mehrtens
The napi_schedule() call will only schedule the NAPI if it is not already running. To make sure that we do not deactivate interrupts without scheduling NAPI only deactivate the interrupts in case NAPI also gets scheduled. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Use napi_complete_done()Hauke Mehrtens
Use napi_complete_done() and activate the interrupts when this function returns true. This way the generic NAPI code can take care of activating the interrupts. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: use netif_tx_napi_add() for TX NAPIHauke Mehrtens
netif_tx_napi_add() should be used for NAPI in the TX direction instead of the netif_napi_add() function. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14net: lantiq: Wake TX queue againHauke Mehrtens
The call to netif_wake_queue() when the TX descriptors were freed was missing. When there are no TX buffers available the TX queue will be stopped, but it was not started again when they are available again, this is fixed in this patch. Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14rndis_host: increase sleep time in the query-response loopOlympia Giannou
Some WinCE devices face connectivity issues via the NDIS interface. They fail to register, resulting in -110 timeout errors and failures during the probe procedure. In this kind of WinCE devices, the Windows-side ndis driver needs quite more time to be loaded and configured, so that the linux rndis host queries to them fail to be responded correctly on time. More specifically, when INIT is called on the WinCE side - no other requests can be served by the Client and this results in a failed QUERY afterwards. The increase of the waiting time on the side of the linux rndis host in the command-response loop leaves the INIT process to complete and respond to a QUERY, which comes afterwards. The WinCE devices with this special "feature" in their ndis driver are satisfied by this fix. Signed-off-by: Olympia Giannou <olympia.giannou@leica-geosystems.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11net: ethernet: ti: cpsw_new: fix suspend/resumeGrygorii Strashko
Add missed suspend/resume callbacks to properly restore networking after suspend/resume cycle. Fixes: ed3525eda4c4 ("net: ethernet: ti: introduce cpsw switchdev based driver part 1 - dual-emac") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11net: ipa: fix u32_replace_bits by u32p_xxx versionVadym Kochan
Looks like u32p_replace_bits() should be used instead of u32_replace_bits() which does not modifies the value but returns the modified version. Fixes: 2b9feef2b6c2 ("soc: qcom: ipa: filter and routing tables") Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11hinic: fix rewaking txq after netif_tx_disableLuo bin
When calling hinic_close in hinic_set_channels, all queues are stopped after netif_tx_disable, but some queue may be rewaken in free_tx_poll by mistake while drv is handling tx irq. If one queue is rewaken core may call hinic_xmit_frame to send pkt after netif_tx_disable within a short time which may results in accessing memory that has been already freed in hinic_close. So we call napi_disable before netif_tx_disable in hinic_close to fix this bug. Fixes: 2eed5a8b614b ("hinic: add set_channels ethtool_ops support") Signed-off-by: Luo bin <luobin9@huawei.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11taprio: Fix allowing too small intervalsVinicius Costa Gomes
It's possible that the user specifies an interval that couldn't allow any packet to be transmitted. This also avoids the issue of the hrtimer handler starving the other threads because it's running too often. The solution is to reject interval sizes that according to the current link speed wouldn't allow any packet to be transmitted. Reported-by: syzbot+8267241609ae8c23b248@syzkaller.appspotmail.com Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler") Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2020-09-09 This series contains updates to i40e and igc drivers. Stefan Assmann changes num_vlans to u16 to fix may be used uninitialized error and propagates error in i40_set_vsi_promisc() for i40e. Vinicius corrects timestamping latency values for i225 devices and accounts for TX timestamping delay for igc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11enetc: Fix mdio bus removal on PF probe bailoutClaudiu Manoil
This is the correct resolution for the conflict from merging the "net" tree fix: commit 26cb7085c898 ("enetc: Remove the mdio bus on PF probe bailout") with the "net-next" new work: commit 07095c025ac2 ("net: enetc: Use DT protocol information to set up the ports") that moved mdio bus allocation to an ealier stage of the PF probing routine. Fixes: a57066b1a019 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>