summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2022-04-20net: dsa: make cross-chip notifiers more efficient for host eventsVladimir Oltean
To determine whether a given port should react to the port targeted by the notifier, dsa_port_host_vlan_match() and dsa_port_host_address_match() look at the positioning of the switch port currently executing the notifier relative to the switch port for which the notifier was emitted. To maintain stylistic compatibility with the other match functions from switch.c, the host address and host VLAN match functions take the notifier information about targeted port, switch and tree indices as argument. However, these functions only use that information to retrieve the struct dsa_port *targeted_dp, which is an invariant for the outer loop that calls them. So it makes more sense to calculate the targeted dp only once, and pass it to them as argument. But furthermore, the targeted dp is actually known at the time the call to dsa_port_notify() is made. It is just that we decide to only save the indices of the port, switch and tree in the notifier structure, just to retrace our steps and find the dp again using dsa_switch_find() and dsa_to_port(). But both the above functions are relatively expensive, since they need to iterate through lists. It appears more straightforward to make all notifiers just pass the targeted dp inside their info structure, and have the code that needs the indices to look at info->dp->index instead of info->port, or info->dp->ds->index instead of info->sw_index, or info->dp->ds->dst->index instead of info->tree_index. For the sake of consistency, all cross-chip notifiers are converted to pass the "dp" directly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-20net: dsa: move reset of VLAN filtering to dsa_port_switchdev_unsync_attrsVladimir Oltean
In dsa_port_switchdev_unsync_attrs() there is a comment that resetting the VLAN filtering isn't done where it is expected. And since commit 108dc8741c20 ("net: dsa: Avoid cross-chip syncing of VLAN filtering"), there is no reason to handle this in switch.c either. Therefore, move the logic to port.c, and adapt it slightly to the data structures and naming conventions from there. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-19bpf: Fix usage of trace RCU in local storage.KP Singh
bpf_{sk,task,inode}_storage_free() do not need to use call_rcu_tasks_trace as no BPF program should be accessing the owner as it's being destroyed. The only other reader at this point is bpf_local_storage_map_free() which uses normal RCU. The only path that needs trace RCU are: * bpf_local_storage_{delete,update} helpers * map_{delete,update}_elem() syscalls Fixes: 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs") Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220418155158.2865678-1-kpsingh@kernel.org
2022-04-19netlink: reset network and mac headers in netlink_dump()Eric Dumazet
netlink_dump() is allocating an skb, reserves space in it but forgets to reset network header. This allows a BPF program, invoked later from sk_filter() to access uninitialized kernel memory from the reserved space. Theorically mac header reset could be omitted, because it is set to a special initial value. bpf_internal_load_pointer_neg_helper calls skb_mac_header() without checking skb_mac_header_was_set(). Relying on skb->len not being too big seems fragile. We also could add a sanity check in bpf_internal_load_pointer_neg_helper() to avoid surprises in the future. syzbot report was: BUG: KMSAN: uninit-value in ___bpf_prog_run+0xa22b/0xb420 kernel/bpf/core.c:1637 ___bpf_prog_run+0xa22b/0xb420 kernel/bpf/core.c:1637 __bpf_prog_run32+0x121/0x180 kernel/bpf/core.c:1796 bpf_dispatcher_nop_func include/linux/bpf.h:784 [inline] __bpf_prog_run include/linux/filter.h:626 [inline] bpf_prog_run include/linux/filter.h:633 [inline] __bpf_prog_run_save_cb+0x168/0x580 include/linux/filter.h:756 bpf_prog_run_save_cb include/linux/filter.h:770 [inline] sk_filter_trim_cap+0x3bc/0x8c0 net/core/filter.c:150 sk_filter include/linux/filter.h:905 [inline] netlink_dump+0xe0c/0x16c0 net/netlink/af_netlink.c:2276 netlink_recvmsg+0x1129/0x1c80 net/netlink/af_netlink.c:2002 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_read_iter+0x5a9/0x630 net/socket.c:1039 do_iter_readv_writev+0xa7f/0xc70 do_iter_read+0x52c/0x14c0 fs/read_write.c:786 vfs_readv fs/read_write.c:906 [inline] do_readv+0x432/0x800 fs/read_write.c:943 __do_sys_readv fs/read_write.c:1034 [inline] __se_sys_readv fs/read_write.c:1031 [inline] __x64_sys_readv+0xe5/0x120 fs/read_write.c:1031 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was stored to memory at: ___bpf_prog_run+0x96c/0xb420 kernel/bpf/core.c:1558 __bpf_prog_run32+0x121/0x180 kernel/bpf/core.c:1796 bpf_dispatcher_nop_func include/linux/bpf.h:784 [inline] __bpf_prog_run include/linux/filter.h:626 [inline] bpf_prog_run include/linux/filter.h:633 [inline] __bpf_prog_run_save_cb+0x168/0x580 include/linux/filter.h:756 bpf_prog_run_save_cb include/linux/filter.h:770 [inline] sk_filter_trim_cap+0x3bc/0x8c0 net/core/filter.c:150 sk_filter include/linux/filter.h:905 [inline] netlink_dump+0xe0c/0x16c0 net/netlink/af_netlink.c:2276 netlink_recvmsg+0x1129/0x1c80 net/netlink/af_netlink.c:2002 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_read_iter+0x5a9/0x630 net/socket.c:1039 do_iter_readv_writev+0xa7f/0xc70 do_iter_read+0x52c/0x14c0 fs/read_write.c:786 vfs_readv fs/read_write.c:906 [inline] do_readv+0x432/0x800 fs/read_write.c:943 __do_sys_readv fs/read_write.c:1034 [inline] __se_sys_readv fs/read_write.c:1031 [inline] __x64_sys_readv+0xe5/0x120 fs/read_write.c:1031 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:737 [inline] slab_alloc_node mm/slub.c:3244 [inline] __kmalloc_node_track_caller+0xde3/0x14f0 mm/slub.c:4972 kmalloc_reserve net/core/skbuff.c:354 [inline] __alloc_skb+0x545/0xf90 net/core/skbuff.c:426 alloc_skb include/linux/skbuff.h:1158 [inline] netlink_dump+0x30f/0x16c0 net/netlink/af_netlink.c:2242 netlink_recvmsg+0x1129/0x1c80 net/netlink/af_netlink.c:2002 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_read_iter+0x5a9/0x630 net/socket.c:1039 do_iter_readv_writev+0xa7f/0xc70 do_iter_read+0x52c/0x14c0 fs/read_write.c:786 vfs_readv fs/read_write.c:906 [inline] do_readv+0x432/0x800 fs/read_write.c:943 __do_sys_readv fs/read_write.c:1034 [inline] __se_sys_readv fs/read_write.c:1031 [inline] __x64_sys_readv+0xe5/0x120 fs/read_write.c:1031 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x44/0xae CPU: 0 PID: 3470 Comm: syz-executor751 Not tainted 5.17.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: db65a3aaf29e ("netlink: Trim skb to alloc size to avoid MSG_TRUNC") Fixes: 9063e21fb026 ("netlink: autosize skb lengthes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220415181442.551228-1-eric.dumazet@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: return EINVAL when request cannot succeedFlorent Fourcot
A request without interface name/interface index/interface group cannot work. We should return EINVAL Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellinkFlorent Fourcot
If IFLA_ALT_IFNAME is set and given interface is not found, we should return ENODEV and be consistent with IFLA_IFNAME behaviour This commit extends feature of commit 76c9ac0ee878, "net: rtnetlink: add possibility to use alternative names as message handle" CC: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: enable alt_ifname for setlink/newlinkFlorent Fourcot
buffer called "ifname" given in function rtnl_dev_get is always valid when called by setlink/newlink, but contains only empty string when IFLA_IFNAME is not given. So IFLA_ALT_IFNAME is always ignored This patch fixes rtnl_dev_get function with a remove of ifname argument, and move ifname copy in do_setlink when required. It extends feature of commit 76c9ac0ee878, "net: rtnetlink: add possibility to use alternative names as message handle"" CC: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19rtnetlink: return ENODEV when ifname does not exist and group is givenFlorent Fourcot
When the interface does not exist, and a group is given, the given parameters are being set to all interfaces of the given group. The given IFNAME/ALT_IF_NAME are being ignored in that case. That can be dangerous since a typo (or a deleted interface) can produce weird side effects for caller: Case 1: IFLA_IFNAME=valid_interface IFLA_GROUP=1 MTU=1234 Case 1 will update MTU and group of the given interface "valid_interface". Case 2: IFLA_IFNAME=doesnotexist IFLA_GROUP=1 MTU=1234 Case 2 will update MTU of all interfaces in group 1. IFLA_IFNAME is ignored in this case This behaviour is not consistent and dangerous. In order to fix this issue, we now return ENODEV when the given IFNAME does not exist. Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19net: sched: support hash selecting tx queueTonghao Zhang
This patch allows users to pick queue_mapping, range from A to B. Then we can load balance packets from A to B tx queue. The range is an unsigned 16bit value in decimal format. $ tc filter ... action skbedit queue_mapping skbhash A B "skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit") is enhanced with flags: SKBEDIT_F_TXQ_SKBHASH +----+ +----+ +----+ | P1 | | P2 | | Pn | +----+ +----+ +----+ | | | +-----------+-----------+ | | clsact/skbedit | MQ v +-----------+-----------+ | q0 | qn | qm v v v HTB/FQ FIFO ... FIFO For example: If P1 sends out packets to different Pods on other host, and we want distribute flows from qn - qm. Then we can use skb->hash as hash. setup commands: $ NETDEV=eth0 $ ip netns add n1 $ ip link add ipv1 link $NETDEV type ipvlan mode l2 $ ip link set ipv1 netns n1 $ ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up $ tc qdisc add dev $NETDEV clsact $ tc filter add dev $NETDEV egress protocol ip prio 1 \ flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping skbhash 2 6 $ tc qdisc add dev $NETDEV handle 1: root mq $ tc qdisc add dev $NETDEV parent 1:1 handle 2: htb $ tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit $ tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit $ tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1 $ tc qdisc add dev $NETDEV parent 1:3 pfifo $ tc qdisc add dev $NETDEV parent 1:4 pfifo $ tc qdisc add dev $NETDEV parent 1:5 pfifo $ tc qdisc add dev $NETDEV parent 1:6 pfifo $ tc qdisc add dev $NETDEV parent 1:7 pfifo $ ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10 pick txqueue from 2 - 6: $ ethtool -S $NETDEV | grep -i tx_queue_[0-9]_bytes tx_queue_0_bytes: 42 tx_queue_1_bytes: 0 tx_queue_2_bytes: 11442586444 tx_queue_3_bytes: 7383615334 tx_queue_4_bytes: 3981365579 tx_queue_5_bytes: 3983235051 tx_queue_6_bytes: 6706236461 tx_queue_7_bytes: 42 tx_queue_8_bytes: 0 tx_queue_9_bytes: 0 txqueues 2 - 6 are mapped to classid 1:3 - 1:7 $ tc -s class show dev $NETDEV ... class mq 1:3 root leaf 8002: Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:4 root leaf 8003: Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:5 root leaf 8004: Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:6 root leaf 8005: Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 class mq 1:7 root leaf 8006: Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ... Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19net: sched: use queue_mapping to pick tx queueTonghao Zhang
This patch fixes issue: * If we install tc filters with act_skbedit in clsact hook. It doesn't work, because netdev_core_pick_tx() overwrites queue_mapping. $ tc filter ... action skbedit queue_mapping 1 And this patch is useful: * We can use FQ + EDT to implement efficient policies. Tx queues are picked by xps, ndo_select_queue of netdev driver, or skb hash in netdev_core_pick_tx(). In fact, the netdev driver, and skb hash are _not_ under control. xps uses the CPUs map to select Tx queues, but we can't figure out which task_struct of pod/containter running on this cpu in most case. We can use clsact filters to classify one pod/container traffic to one Tx queue. Why ? In containter networking environment, there are two kinds of pod/ containter/net-namespace. One kind (e.g. P1, P2), the high throughput is key in these applications. But avoid running out of network resource, the outbound traffic of these pods is limited, using or sharing one dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods (e.g. Pn), the low latency of data access is key. And the traffic is not limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc. This choice provides two benefits. First, contention on the HTB/FQ Qdisc lock is significantly reduced since fewer CPUs contend for the same queue. More importantly, Qdisc contention can be eliminated completely if each CPU has its own FIFO Qdisc for the second kind of pods. There must be a mechanism in place to support classifying traffic based on pods/container to different Tx queues. Note that clsact is outside of Qdisc while Qdisc can run a classifier to select a sub-queue under the lock. In general recording the decision in the skb seems a little heavy handed. This patch introduces a per-CPU variable, suggested by Eric. The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit(). - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag is set in qdisc->enqueue() though tx queue has been selected in netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared firstly in __dev_queue_xmit(), is useful: - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy: For example, eth0, macvlan in pod, which root Qdisc install skbedit queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping because there is no filters in clsact or tx Qdisc of this netdev. Same action taked in eth0, ixgbe in Host. - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it in __dev_queue_xmit when processing next packets. For performance reasons, use the static key. If user does not config the NET_EGRESS, the patch will not be compiled. +----+ +----+ +----+ | P1 | | P2 | | Pn | +----+ +----+ +----+ | | | +-----------+-----------+ | | clsact/skbedit | MQ v +-----------+-----------+ | q0 | q1 | qn v v v HTB/FQ HTB/FQ ... FIFO Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Talal Ahmad <talalahmad@google.com> Cc: Kevin Hao <haokexin@gmail.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Antoine Tenart <atenart@kernel.org> Cc: Wei Wang <weiwan@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-19ipvs: correctly print the memory size of ip_vs_conn_tabPengcheng Yang
The memory size of ip_vs_conn_tab changed after we use hlist instead of list. Fixes: 731109e78415 ("ipvs: use hlist instead of list") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-19net: dsa: hellcreek: Calculate checksums in taggerKurt Kanzenbach
In case the checksum calculation is offloaded to the DSA master network interface, it will include the switch trailing tag. As soon as the switch strips that tag on egress, the calculated checksum is wrong. Therefore, add the checksum calculation to the tagger (if required) before adding the switch tag. This way, the hellcreek code works with all DSA master interfaces regardless of their declared feature set. Fixes: 01ef09caad66 ("net: dsa: Add tag handling for Hirschmann Hellcreek switches") Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220415103320.90657-1-kurt@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-18devlink: add port to line card relationship setJiri Pirko
In order to properly inform user about relationship between port and line card, introduce a driver API to set line card for a port. Use this information to extend port devlink netlink message by line card index and also include the line card index into phys_port_name and by that into a netdevice name. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: implement line card active stateJiri Pirko
Allow driver to mark a line card as active. Expose this state to the userspace over devlink netlink interface with proper notifications. 'active' state means that line card was plugged in after being provisioned. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: implement line card provisioningJiri Pirko
In order to be able to configure all needed stuff on a port/netdevice of a line card without the line card being present, introduce line card provisioning. Basically by setting a type, provisioning process will start and driver is supposed to create a placeholder for instances (ports/netdevices) for a line card type. Allow the user to query the supported line card types over line card get command. Then implement two netlink command SET to allow user to set/unset the card type. On the driver API side, add provision/unprovision ops and supported types array to be advertised. Upon provision op call, the driver should take care of creating the instances for the particular line card type. Introduce provision_set/clear() functions to be called by the driver once the provisioning/unprovisioning is done on its side. These helpers are not to be called directly due to the async nature of provisioning. Example: $ devlink port # No ports are listed $ devlink lc pci/0000:01:00.0: lc 1 state unprovisioned supported_types: 16x100G lc 2 state unprovisioned supported_types: 16x100G lc 3 state unprovisioned supported_types: 16x100G lc 4 state unprovisioned supported_types: 16x100G lc 5 state unprovisioned supported_types: 16x100G lc 6 state unprovisioned supported_types: 16x100G lc 7 state unprovisioned supported_types: 16x100G lc 8 state unprovisioned supported_types: 16x100G $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G $ devlink lc show pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 state active type 16x100G supported_types: 16x100G $ devlink port pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4 pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4 pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4 pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4 pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4 pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4 pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4 pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4 pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4 pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4 pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4 pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4 pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4 pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4 pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4 pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4 $ devlink lc set pci/0000:01:00.0 lc 8 notype Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: add support to create line card and expose to userJiri Pirko
Extend the devlink API so the driver is going to be able to create and destroy linecard instances. There can be multiple line cards per devlink device. Expose this new type of object over devlink netlink API to the userspace, with notifications. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18tcp: fix signed/unsigned comparisonEric Dumazet
Kernel test robot reported: smatch warnings: net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero. I actually had one packetdrill failing because of this bug, and was about to send the fix :) v2: Andreas Schwab also pointed out that @reason needs to be negated before we reach tcp_drop_reason() Fixes: 4b506af9c5b8 ("tcp: add two drop reasons for tcp_ack()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17batman-adv: Don't skb_split skbuffs with frag_listSven Eckelmann
The receiving interface might have used GRO to receive more fragments than MAX_SKB_FRAGS fragments. In this case, these will not be stored in skb_shinfo(skb)->frags but merged into the frag list. batman-adv relies on the function skb_split to split packets up into multiple smaller packets which are not larger than the MTU on the outgoing interface. But this function cannot handle frag_list entries and is only operating on skb_shinfo(skb)->frags. If it is still trying to split such an skb and xmit'ing it on an interface without support for NETIF_F_FRAGLIST, then validate_xmit_skb() will try to linearize it. But this fails due to inconsistent information. And __pskb_pull_tail will trigger a BUG_ON after skb_copy_bits() returns an error. In case of entries in frag_list, just linearize the skb before operating on it with skb_split(). Reported-by: Felix Kaechele <felix@kaechele.ca> Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol") Signed-off-by: Sven Eckelmann <sven@narfation.org> Tested-by: Felix Kaechele <felix@kaechele.ca> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-04-17can: isotp: stop timeout monitoring when no first frame was sentOliver Hartkopp
The first attempt to fix a the 'impossible' WARN_ON_ONCE(1) in isotp_tx_timer_handler() focussed on the identical CAN IDs created by the syzbot reproducer and lead to upstream fix/commit 3ea566422cbd ("can: isotp: sanitize CAN ID checks in isotp_bind()"). But this did not catch the root cause of the wrong tx.state in the tx_timer handler. In the isotp 'first frame' case a timeout monitoring needs to be started before the 'first frame' is send. But when this sending failed the timeout monitoring for this specific frame has to be disabled too. Otherwise the tx_timer is fired with the 'warn me' tx.state of ISOTP_IDLE. Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol") Link: https://lore.kernel.org/all/20220405175112.2682-1-socketcan@hartkopp.net Reported-by: syzbot+2339c27f5c66c652843e@syzkaller.appspotmail.com Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-04-17tcp: add drop reason support to tcp_ofo_queue()Eric Dumazet
packets in OFO queue might be redundant, and dropped. tcp_drop() is no longer needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reasons to tcp_rcv_synsent_state_process()Eric Dumazet
Re-use existing reasons. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: make tcp_rcv_synsent_state_process() drop monitor friendEric Dumazet
1) A valid RST packet should be consumed, to not confuse drop monitor. 2) Same remark for packet validating cross syn setup, even if we might ignore part of it. 3) When third packet of 3WHS is delayed, do not pretend the SYNACK was dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_prune_ofo_queue()Eric Dumazet
Add one reason for packets dropped from OFO queue because of memory pressure. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add two drop reasons for tcp_ack()Eric Dumazet
Add TCP_TOO_OLD_ACK and TCP_ACK_UNSENT_DATA drop reasons so that tcp_rcv_established() can report them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reasons to tcp_rcv_state_process()Eric Dumazet
Add basic support for drop reasons in tcp_rcv_state_process() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: make tcp_rcv_state_process() drop monitor friendlyEric Dumazet
tcp_rcv_state_process() incorrectly drops packets instead of consuming it, making drop monitor very noisy, if not unusable. Calling tcp_time_wait() or tcp_done() is part of standard behavior, packets triggering these actions were not dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: add drop reason support to tcp_validate_incoming()Eric Dumazet
Creates four new drop reasons for the following cases: 1) packet being rejected by RFC 7323 PAWS check 2) packet being rejected by SEQUENCE check 3) Invalid RST packet 4) Invalid SYN packet Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: get rid of rst_seq_matchEric Dumazet
Small cleanup in tcp_validate_incoming(), no need for rst_seq_match setting and testing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17tcp: consume incoming skb leading to a resetEric Dumazet
Whenever tcp_validate_incoming() handles a valid RST packet, we should not pretend the packet was dropped. Create a special section at the end of tcp_validate_incoming() to handle this case. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-17net/ipv6: Introduce accept_unsolicited_na knob to implement router-side ↵Arun Ajith S
changes for RFC9131 Add a new neighbour cache entry in STALE state for routers on receiving an unsolicited (gratuitous) neighbour advertisement with target link-layer-address option specified. This is similar to the arp_accept configuration for IPv4. A new sysctl endpoint is created to turn on this behaviour: /proc/sys/net/ipv6/conf/interface/accept_unsolicited_na. Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15ipv6: make ip6_rt_gc_expire an atomic_tEric Dumazet
Reads and Writes to ip6_rt_gc_expire always have been racy, as syzbot reported lately [1] There is a possible risk of under-flow, leading to unexpected high value passed to fib6_run_gc(), although I have not observed this in the field. Hosts hitting ip6_dst_gc() very hard are under pretty bad state anyway. [1] BUG: KCSAN: data-race in ip6_dst_gc / ip6_dst_gc read-write to 0xffff888102110744 of 4 bytes by task 13165 on cpu 1: ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311 dst_alloc+0x9b/0x160 net/core/dst.c:86 ip6_dst_alloc net/ipv6/route.c:344 [inline] icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261 mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807 mld_send_cr net/ipv6/mcast.c:2119 [inline] mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651 process_one_work+0x3d3/0x720 kernel/workqueue.c:2289 worker_thread+0x618/0xa70 kernel/workqueue.c:2436 kthread+0x1a9/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 read-write to 0xffff888102110744 of 4 bytes by task 11607 on cpu 0: ip6_dst_gc+0x1f3/0x220 net/ipv6/route.c:3311 dst_alloc+0x9b/0x160 net/core/dst.c:86 ip6_dst_alloc net/ipv6/route.c:344 [inline] icmp6_dst_alloc+0xb2/0x360 net/ipv6/route.c:3261 mld_sendpack+0x2b9/0x580 net/ipv6/mcast.c:1807 mld_send_cr net/ipv6/mcast.c:2119 [inline] mld_ifc_work+0x576/0x800 net/ipv6/mcast.c:2651 process_one_work+0x3d3/0x720 kernel/workqueue.c:2289 worker_thread+0x618/0xa70 kernel/workqueue.c:2436 kthread+0x1a9/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 value changed: 0x00000bb3 -> 0x00000ba9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 11607 Comm: kworker/0:21 Not tainted 5.18.0-rc1-syzkaller-00037-g42e7a03d3bad-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: mld mld_ifc_work Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220413181333.649424-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15net: Handle l3mdev in ip_tunnel_init_flowDavid Ahern
Ido reported that the commit referenced in the Fixes tag broke a gre use case with dummy devices. Add a check to ip_tunnel_init_flow to see if the oif is an l3mdev port and if so set the oif to 0 to avoid the oif comparison in fib_lookup_good_nhc. Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using ↵David Ahern
netdev_master_upper_dev_get_rcu Next patch uses l3mdev_master_upper_ifindex_by_index_rcu which throws a splat with debug kernels: [13783.087570] ------------[ cut here ]------------ [13783.093974] RTNL: assertion failed at net/core/dev.c (6702) [13783.100761] WARNING: CPU: 3 PID: 51132 at net/core/dev.c:6702 netdev_master_upper_dev_get+0x16a/0x1a0 [13783.184226] CPU: 3 PID: 51132 Comm: kworker/3:3 Not tainted 5.17.0-custom-100090-g6f963aafb1cc #682 [13783.194788] Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017 [13783.204755] Workqueue: mld mld_ifc_work [ipv6] [13783.210338] RIP: 0010:netdev_master_upper_dev_get+0x16a/0x1a0 [13783.217209] Code: 0f 85 e3 fe ff ff e8 65 ac ec fe ba 2e 1a 00 00 48 c7 c6 60 6f 38 83 48 c7 c7 c0 70 38 83 c6 05 5e b5 d7 01 01 e8 c6 29 52 00 <0f> 0b e9 b8 fe ff ff e8 5a 6c 35 ff e9 1c ff ff ff 48 89 ef e8 7d [13783.238659] RSP: 0018:ffffc9000b37f5a8 EFLAGS: 00010286 [13783.244995] RAX: 0000000000000000 RBX: ffff88812ee5c000 RCX: 0000000000000000 [13783.253379] RDX: ffff88811ce09d40 RSI: ffffffff812d0fcd RDI: fffff5200166fea7 [13783.261769] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8882375f4287 [13783.270138] R10: ffffed1046ebe850 R11: 0000000000000001 R12: dffffc0000000000 [13783.278510] R13: 0000000000000275 R14: ffffc9000b37f688 R15: ffff8881273b4af8 [13783.286870] FS: 0000000000000000(0000) GS:ffff888237400000(0000) knlGS:0000000000000000 [13783.296352] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13783.303177] CR2: 00007ff25fc9b2e8 CR3: 0000000174d23000 CR4: 00000000001006e0 [13783.311546] Call Trace: [13783.314660] <TASK> [13783.317553] l3mdev_master_upper_ifindex_by_index_rcu+0x43/0xe0 ... Change l3mdev_master_upper_ifindex_by_index_rcu to use netdev_master_upper_dev_get_rcu. Fixes: 6a6d6681ac1a ("l3mdev: add function to retreive upper master") Signed-off-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Alexis Bauvin <abauvin@scaleway.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15net/sched: cls_u32: fix possible leak in u32_init_knode()Eric Dumazet
While investigating a related syzbot report, I found that whenever call to tcf_exts_init() from u32_init_knode() is failing, we end up with an elevated refcount on ht->refcnt To avoid that, only increase the refcount after all possible errors have been evaluated. Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15net/sched: cls_u32: fix netns refcount changes in u32_change()Eric Dumazet
We are now able to detect extra put_net() at the moment they happen, instead of much later in correct code paths. u32_init_knode() / tcf_exts_init() populates the ->exts.net pointer, but as mentioned in tcf_exts_init(), the refcount on netns has not been elevated yet. The refcount is taken only once tcf_exts_get_net() is called. So the two u32_destroy_key() calls from u32_change() are attempting to release an invalid reference on the netns. syzbot report: refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 21708 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31 Modules linked in: CPU: 0 PID: 21708 Comm: syz-executor.5 Not tainted 5.18.0-rc2-next-20220412-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31 Code: 1d 14 b6 b2 09 31 ff 89 de e8 6d e9 89 fd 84 db 75 e0 e8 84 e5 89 fd 48 c7 c7 40 aa 26 8a c6 05 f4 b5 b2 09 01 e8 e5 81 2e 05 <0f> 0b eb c4 e8 68 e5 89 fd 0f b6 1d e3 b5 b2 09 31 ff 89 de e8 38 RSP: 0018:ffffc900051af1b0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000040000 RSI: ffffffff8160a0c8 RDI: fffff52000a35e28 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff81604a9e R11: 0000000000000000 R12: 1ffff92000a35e3b R13: 00000000ffffffef R14: ffff8880211a0194 R15: ffff8880577d0a00 FS: 00007f25d183e700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f19c859c028 CR3: 0000000051009000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_dec include/linux/refcount.h:344 [inline] refcount_dec include/linux/refcount.h:359 [inline] ref_tracker_free+0x535/0x6b0 lib/ref_tracker.c:118 netns_tracker_free include/net/net_namespace.h:327 [inline] put_net_track include/net/net_namespace.h:341 [inline] tcf_exts_put_net include/net/pkt_cls.h:255 [inline] u32_destroy_key.isra.0+0xa7/0x2b0 net/sched/cls_u32.c:394 u32_change+0xe01/0x3140 net/sched/cls_u32.c:909 tc_new_tfilter+0x98d/0x2200 net/sched/cls_api.c:2148 rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:6016 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2495 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f25d0689049 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f25d183e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f25d079c030 RCX: 00007f25d0689049 RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000005 RBP: 00007f25d06e308d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd0b752e3f R14: 00007f25d183e300 R15: 0000000000022000 </TASK> Fixes: 35c55fc156d8 ("cls_u32: use tcf_exts_get_net() before call_rcu()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15ipv6: fix NULL deref in ip6_rcv_core()Eric Dumazet
idev can be NULL, as the surrounding code suggests. Fixes: 4daf841a2ef3 ("net: ipv6: add skb drop reasons to ip6_rcv_core()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Menglong Dong <imagedong@tencent.com> Cc: Jiang Biao <benbjiang@tencent.com> Cc: Hao Peng <flyingpeng@tencent.com> Link: https://lore.kernel.org/r/20220413205653.1178458-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15net_sched: make qdisc_reset() smallerEric Dumazet
For some unknown reason qdisc_reset() is using a convoluted way of freeing two lists of skbs. Use __skb_queue_purge() instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20220414011004.2378350-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15xsk: Drop ternary operator from xskq_cons_has_entriesMaciej Fijalkowski
Simplify the mentioned helper function by removing ternary operator. The expression that is there outputs the boolean value by itself. This helper might be used in the hot path so this simplification can also be considered as micro optimization. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220413153015.453864-15-maciej.fijalkowski@intel.com
2022-04-15xsk: Diversify return codes in xsk_rcv_check()Maciej Fijalkowski
Inspired by patch that made xdp_do_redirect() return values for XSKMAP more meaningful, return -ENXIO instead of -EINVAL for socket being unbound in xsk_rcv_check() as this is the usual value that is returned for such event. In turn, it is now possible to easily distinguish what went wrong, which is a bit harder when for both cases checked, -EINVAL was returned. Return codes can be counted in a nice way via bpftrace oneliner that Jesper has shown: bpftrace -e 'tracepoint:xdp:xdp_redirect* {@err[-args->err] = count();}' Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220413153015.453864-3-maciej.fijalkowski@intel.com
2022-04-15xsk: Improve xdp_do_redirect() error codesBjörn Töpel
The error codes returned by xdp_do_redirect() when redirecting a frame to an AF_XDP socket has not been very useful. A driver could not distinguish between different errors. Prior this change the following codes where used: Socket not bound or incorrect queue/netdev: EINVAL XDP frame/AF_XDP buffer size mismatch: ENOSPC Could not allocate buffer (copy mode): ENOSPC AF_XDP Rx buffer full: ENOSPC After this change: Socket not bound or incorrect queue/netdev: EINVAL XDP frame/AF_XDP buffer size mismatch: ENOSPC Could not allocate buffer (copy mode): ENOMEM AF_XDP Rx buffer full: ENOBUFS An AF_XDP zero-copy driver can now potentially determine if the failure was due to a full Rx buffer, and if so stop processing more frames, yielding to the userland AF_XDP application. Signed-off-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20220413153015.453864-2-maciej.fijalkowski@intel.com
2022-04-15net: ethtool: move checks before rtnl_lock() in ethnl_set_ringsJie Wang
Currently these two checks in ethnl_set_rings are added after rtnl_lock() which will do useless works if the request is invalid. So this patch moves these checks before the rtnl_lock() to avoid these costs. Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15net: ethtool: extend ringparam set/get APIs for tx_pushJie Wang
Currently tx push is a standard driver feature which controls use of a fast path descriptor push. So this patch extends the ringparam APIs and data structures to support set/get tx push by ethtool -G/g. Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-15openvswitch: fix OOB access in reserve_sfa_size()Paolo Valerio
Given a sufficiently large number of actions, while copying and reserving memory for a new action of a new flow, if next_offset is greater than MAX_ACTIONS_BUFSIZE, the function reserve_sfa_size() does not return -EMSGSIZE as expected, but it allocates MAX_ACTIONS_BUFSIZE bytes increasing actions_len by req_size. This can then lead to an OOB write access, especially when further actions need to be copied. Fix it by rearranging the flow action size check. KASAN splat below: ================================================================== BUG: KASAN: slab-out-of-bounds in reserve_sfa_size+0x1ba/0x380 [openvswitch] Write of size 65360 at addr ffff888147e4001c by task handler15/836 CPU: 1 PID: 836 Comm: handler15 Not tainted 5.18.0-rc1+ #27 ... Call Trace: <TASK> dump_stack_lvl+0x45/0x5a print_report.cold+0x5e/0x5db ? __lock_text_start+0x8/0x8 ? reserve_sfa_size+0x1ba/0x380 [openvswitch] kasan_report+0xb5/0x130 ? reserve_sfa_size+0x1ba/0x380 [openvswitch] kasan_check_range+0xf5/0x1d0 memcpy+0x39/0x60 reserve_sfa_size+0x1ba/0x380 [openvswitch] __add_action+0x24/0x120 [openvswitch] ovs_nla_add_action+0xe/0x20 [openvswitch] ovs_ct_copy_action+0x29d/0x1130 [openvswitch] ? __kernel_text_address+0xe/0x30 ? unwind_get_return_address+0x56/0xa0 ? create_prof_cpu_mask+0x20/0x20 ? ovs_ct_verify+0xf0/0xf0 [openvswitch] ? prep_compound_page+0x198/0x2a0 ? __kasan_check_byte+0x10/0x40 ? kasan_unpoison+0x40/0x70 ? ksize+0x44/0x60 ? reserve_sfa_size+0x75/0x380 [openvswitch] __ovs_nla_copy_actions+0xc26/0x2070 [openvswitch] ? __zone_watermark_ok+0x420/0x420 ? validate_set.constprop.0+0xc90/0xc90 [openvswitch] ? __alloc_pages+0x1a9/0x3e0 ? __alloc_pages_slowpath.constprop.0+0x1da0/0x1da0 ? unwind_next_frame+0x991/0x1e40 ? __mod_node_page_state+0x99/0x120 ? __mod_lruvec_page_state+0x2e3/0x470 ? __kasan_kmalloc_large+0x90/0xe0 ovs_nla_copy_actions+0x1b4/0x2c0 [openvswitch] ovs_flow_cmd_new+0x3cd/0xb10 [openvswitch] ... Cc: stable@vger.kernel.org Fixes: f28cd2af22a0 ("openvswitch: fix flow actions reallocation") Signed-off-by: Paolo Valerio <pvalerio@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15ip6_gre: Fix skb_under_panic in __gre6_xmit()Peilin Ye
Feng reported an skb_under_panic BUG triggered by running test_ip6gretap() in tools/testing/selftests/bpf/test_tunnel.sh: [ 82.492551] skbuff: skb_under_panic: text:ffffffffb268bb8e len:403 put:12 head:ffff9997c5480000 data:ffff9997c547fff8 tail:0x18b end:0x2c0 dev:ip6gretap11 <...> [ 82.607380] Call Trace: [ 82.609389] <TASK> [ 82.611136] skb_push.cold.109+0x10/0x10 [ 82.614289] __gre6_xmit+0x41e/0x590 [ 82.617169] ip6gre_tunnel_xmit+0x344/0x3f0 [ 82.620526] dev_hard_start_xmit+0xf1/0x330 [ 82.623882] sch_direct_xmit+0xe4/0x250 [ 82.626961] __dev_queue_xmit+0x720/0xfe0 <...> [ 82.633431] packet_sendmsg+0x96a/0x1cb0 [ 82.636568] sock_sendmsg+0x30/0x40 <...> The following sequence of events caused the BUG: 1. During ip6gretap device initialization, tunnel->tun_hlen (e.g. 4) is calculated based on old flags (see ip6gre_calc_hlen()); 2. packet_snd() reserves header room for skb A, assuming tunnel->tun_hlen is 4; 3. Later (in clsact Qdisc), the eBPF program sets a new tunnel key for skb A using bpf_skb_set_tunnel_key() (see _ip6gretap_set_tunnel()); 4. __gre6_xmit() detects the new tunnel key, and recalculates "tun_hlen" (e.g. 12) based on new flags (e.g. TUNNEL_KEY and TUNNEL_SEQ); 5. gre_build_header() calls skb_push() with insufficient reserved header room, triggering the BUG. As sugguested by Cong, fix it by moving the call to skb_cow_head() after the recalculation of tun_hlen. Reproducer: OBJ=$LINUX/tools/testing/selftests/bpf/test_tunnel_kern.o ip netns add at_ns0 ip link add veth0 type veth peer name veth1 ip link set veth0 netns at_ns0 ip netns exec at_ns0 ip addr add 172.16.1.100/24 dev veth0 ip netns exec at_ns0 ip link set dev veth0 up ip link set dev veth1 up mtu 1500 ip addr add dev veth1 172.16.1.200/24 ip netns exec at_ns0 ip addr add ::11/96 dev veth0 ip netns exec at_ns0 ip link set dev veth0 up ip addr add dev veth1 ::22/96 ip link set dev veth1 up ip netns exec at_ns0 \ ip link add dev ip6gretap00 type ip6gretap seq flowlabel 0xbcdef key 2 \ local ::11 remote ::22 ip netns exec at_ns0 ip addr add dev ip6gretap00 10.1.1.100/24 ip netns exec at_ns0 ip addr add dev ip6gretap00 fc80::100/96 ip netns exec at_ns0 ip link set dev ip6gretap00 up ip link add dev ip6gretap11 type ip6gretap external ip addr add dev ip6gretap11 10.1.1.200/24 ip addr add dev ip6gretap11 fc80::200/24 ip link set dev ip6gretap11 up tc qdisc add dev ip6gretap11 clsact tc filter add dev ip6gretap11 egress bpf da obj $OBJ sec ip6gretap_set_tunnel tc filter add dev ip6gretap11 ingress bpf da obj $OBJ sec ip6gretap_get_tunnel ping6 -c 3 -w 10 -q ::11 Fixes: 6712abc168eb ("ip6_gre: add ip6 gre and gretap collect_md mode") Reported-by: Feng Zhou <zhoufeng.zf@bytedance.com> Co-developed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()Peilin Ye
Do not update tunnel->tun_hlen in data plane code. Use a local variable instead, just like "tunnel_hlen" in net/ipv4/ip_gre.c:gre_fb_xmit(). Co-developed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-04-14 1) Fix the output interface for VRF cases in xfrm_dst_lookup. From David Ahern. 2) Fix write out of bounds by doing COW on esp output when the packet size is larger than a page. From Sabrina Dubroca. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15net/packet: fix packet_sock xmit return value checkingHangbin Liu
packet_sock xmit could be dev_queue_xmit, which also returns negative errors. So only checking positive errors is not enough, or userspace sendmsg may return success while packet is not send out. Move the net_xmit_errno() assignment in the braces as checkpatch.pl said do not use assignment in if condition. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15net/smc: Fix sock leak when release after smc_shutdown()Tony Lu
Since commit e5d5aadcf3cd ("net/smc: fix sk_refcnt underflow on linkdown and fallback"), for a fallback connection, __smc_release() does not call sock_put() if its state is already SMC_CLOSED. When calling smc_shutdown() after falling back, its state is set to SMC_CLOSED but does not call sock_put(), so this patch calls it. Reported-and-tested-by: syzbot+6e29a053eb165bd50de5@syzkaller.appspotmail.com Fixes: e5d5aadcf3cd ("net/smc: fix sk_refcnt underflow on linkdown and fallback") Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15rxrpc: Restore removed timer deletionDavid Howells
A recent patch[1] from Eric Dumazet flipped the order in which the keepalive timer and the keepalive worker were cancelled in order to fix a syzbot reported issue[2]. Unfortunately, this enables the mirror image bug whereby the timer races with rxrpc_exit_net(), restarting the worker after it has been cancelled: CPU 1 CPU 2 =============== ===================== if (rxnet->live) <INTERRUPT> rxnet->live = false; cancel_work_sync(&rxnet->peer_keepalive_work); rxrpc_queue_work(&rxnet->peer_keepalive_work); del_timer_sync(&rxnet->peer_keepalive_timer); Fix this by restoring the removed del_timer_sync() so that we try to remove the timer twice. If the timer runs again, it should see ->live == false and not restart the worker. Fixes: 1946014ca3b1 ("rxrpc: fix a race in rxrpc_exit_net()") Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Dumazet <edumazet@google.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/20220404183439.3537837-1-eric.dumazet@gmail.com/ [1] Link: https://syzkaller.appspot.com/bug?extid=724378c4bb58f703b09a [2] Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-15net: page_pool: introduce ethtool statsLorenzo Bianconi
Introduce page_pool APIs to report stats through ethtool and reduce duplicated code in each driver. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>