From 672e97ef689a38cb20c2cc6a1814298fea34461e Mon Sep 17 00:00:00 2001 From: Paul Blakey Date: Tue, 18 Oct 2022 10:34:38 +0300 Subject: net: Fix return value of qdisc ingress handling on success Currently qdisc ingress handling (sch_handle_ingress()) doesn't set a return value and it is left to the old return value of the caller (__netif_receive_skb_core()) which is RX drop, so if the packet is consumed, caller will stop and return this value as if the packet was dropped. This causes a problem in the kernel tcp stack when having a egress tc rule forwarding to a ingress tc rule. The tcp stack sending packets on the device having the egress rule will see the packets as not successfully transmitted (although they actually were), will not advance it's internal state of sent data, and packets returning on such tcp stream will be dropped by the tcp stack with reason ack-of-unsent-data. See reproduction in [0] below. Fix that by setting the return value to RX success if the packet was handled successfully. [0] Reproduction steps: $ ip link add veth1 type veth peer name peer1 $ ip link add veth2 type veth peer name peer2 $ ifconfig peer1 5.5.5.6/24 up $ ip netns add ns0 $ ip link set dev peer2 netns ns0 $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up $ ifconfig veth2 0 up $ ifconfig veth1 0 up #ingress forwarding veth1 <-> veth2 $ tc qdisc add dev veth2 ingress $ tc qdisc add dev veth1 ingress $ tc filter add dev veth2 ingress prio 1 proto all flower \ action mirred egress redirect dev veth1 $ tc filter add dev veth1 ingress prio 1 proto all flower \ action mirred egress redirect dev veth2 #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe $ tc qdisc add dev peer1 clsact $ tc filter add dev peer1 egress prio 20 proto ip flower \ action mirred ingress redirect dev veth1 #run iperf and see connection not running $ iperf3 -s& $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 #delete egress rule, and run again, now should work $ tc filter del dev peer1 egress $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1 Fixes: f697c3e8b35c ("[NET]: Avoid unnecessary cloning for ingress filtering") Signed-off-by: Paul Blakey Signed-off-by: David S. Miller --- net/core/dev.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index fa53830d0683..3be256051e99 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5136,11 +5136,13 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, case TC_ACT_SHOT: mini_qdisc_qstats_cpu_drop(miniq); kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); + *ret = NET_RX_DROP; return NULL; case TC_ACT_STOLEN: case TC_ACT_QUEUED: case TC_ACT_TRAP: consume_skb(skb); + *ret = NET_RX_SUCCESS; return NULL; case TC_ACT_REDIRECT: /* skb_mac_header check was done by cls/act_bpf, so @@ -5153,8 +5155,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, *another = true; break; } + *ret = NET_RX_SUCCESS; return NULL; case TC_ACT_CONSUMED: + *ret = NET_RX_SUCCESS; return NULL; default: break; -- cgit From b5f0de6df6dce8d641ef58ef7012f3304dffb9a1 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Tue, 18 Oct 2022 02:56:03 -0700 Subject: net: dev: Convert sa_data to flexible array in struct sockaddr One of the worst offenders of "fake flexible arrays" is struct sockaddr, as it is the classic example of why GCC and Clang have been traditionally forced to treat all trailing arrays as fake flexible arrays: in the distant misty past, sa_data became too small, and code started just treating it as a flexible array, even though it was fixed-size. The special case by the compiler is specifically that sizeof(sa->sa_data) and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1)) do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat it as a flexible array. However, the coming -fstrict-flex-arrays compiler flag will remove these special cases so that FORTIFY_SOURCE can gain coverage over all the trailing arrays in the kernel that are _not_ supposed to be treated as a flexible array. To deal with this change, convert sa_data to a true flexible array. To keep the structure size the same, move sa_data into a union with a newly introduced sa_data_min with the original size. The result is that FORTIFY_SOURCE can continue to have no idea how large sa_data may actually be, but anything using sizeof(sa->sa_data) must switch to sizeof(sa->sa_data_min). Cc: Jens Axboe Cc: Pavel Begunkov Cc: David Ahern Cc: Dylan Yudaken Cc: Yajun Deng Cc: Petr Machata Cc: Hangbin Liu Cc: Leon Romanovsky Cc: syzbot Cc: Willem de Bruijn Cc: Pablo Neira Ayuso Signed-off-by: Kees Cook Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.org Signed-off-by: Jakub Kicinski --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 3be256051e99..fff62068a53d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8822,7 +8822,7 @@ EXPORT_SYMBOL(dev_set_mac_address_user); int dev_get_mac_address(struct sockaddr *sa, struct net *net, char *dev_name) { - size_t size = sizeof(sa->sa_data); + size_t size = sizeof(sa->sa_data_min); struct net_device *dev; int ret = 0; -- cgit From d120d1a63b2c484d6175873d8ee736a633f74b70 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Wed, 26 Oct 2022 15:22:15 +0200 Subject: net: Remove the obsolte u64_stats_fetch_*_irq() users (net). Now that the 32bit UP oddity is gone and 32bit uses always a sequence count, there is no need for the fetch_irq() variants anymore. Convert to the regular interface. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior Acked-by: Peter Zijlstra (Intel) Signed-off-by: Jakub Kicinski --- net/core/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index fff62068a53d..cfb68db040a4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10477,12 +10477,12 @@ void dev_fetch_sw_netstats(struct rtnl_link_stats64 *s, stats = per_cpu_ptr(netstats, cpu); do { - start = u64_stats_fetch_begin_irq(&stats->syncp); + start = u64_stats_fetch_begin(&stats->syncp); rx_packets = u64_stats_read(&stats->rx_packets); rx_bytes = u64_stats_read(&stats->rx_bytes); tx_packets = u64_stats_read(&stats->tx_packets); tx_bytes = u64_stats_read(&stats->tx_bytes); - } while (u64_stats_fetch_retry_irq(&stats->syncp, start)); + } while (u64_stats_fetch_retry(&stats->syncp, start)); s->rx_packets += rx_packets; s->rx_bytes += rx_bytes; -- cgit From 1d997f1013079c05b642c739901e3584a3ae558d Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Fri, 28 Oct 2022 04:42:21 -0400 Subject: rtnetlink: pass netlink message header and portid to rtnl_configure_link() This patch pass netlink message header and portid to rtnl_configure_link() All the functions in this call chain need to add the parameters so we can use them in the last call rtnl_notify(), and notify the userspace about the new link info if NLM_F_ECHO flag is set. - rtnl_configure_link() - __dev_notify_flags() - rtmsg_ifinfo() - rtmsg_ifinfo_event() - rtmsg_ifinfo_build_skb() - rtmsg_ifinfo_send() - rtnl_notify() Also move __dev_notify_flags() declaration to net/core/dev.h, as Jakub suggested. Signed-off-by: Hangbin Liu Reviewed-by: Guillaume Nault Signed-off-by: Jakub Kicinski --- net/core/dev.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index cfb68db040a4..19e0db536022 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1333,7 +1333,7 @@ void netdev_state_change(struct net_device *dev) call_netdevice_notifiers_info(NETDEV_CHANGE, &change_info.info); - rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL); + rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL, 0, NULL); } } EXPORT_SYMBOL(netdev_state_change); @@ -1469,7 +1469,7 @@ int dev_open(struct net_device *dev, struct netlink_ext_ack *extack) if (ret < 0) return ret; - rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL); + rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP | IFF_RUNNING, GFP_KERNEL, 0, NULL); call_netdevice_notifiers(NETDEV_UP, dev); return ret; @@ -1541,7 +1541,7 @@ void dev_close_many(struct list_head *head, bool unlink) __dev_close_many(head); list_for_each_entry_safe(dev, tmp, head, close_list) { - rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL); + rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP | IFF_RUNNING, GFP_KERNEL, 0, NULL); call_netdevice_notifiers(NETDEV_DOWN, dev); if (unlink) list_del_init(&dev->close_list); @@ -8351,7 +8351,7 @@ static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify) dev_change_rx_flags(dev, IFF_PROMISC); } if (notify) - __dev_notify_flags(dev, old_flags, IFF_PROMISC); + __dev_notify_flags(dev, old_flags, IFF_PROMISC, 0, NULL); return 0; } @@ -8406,7 +8406,7 @@ static int __dev_set_allmulti(struct net_device *dev, int inc, bool notify) dev_set_rx_mode(dev); if (notify) __dev_notify_flags(dev, old_flags, - dev->gflags ^ old_gflags); + dev->gflags ^ old_gflags, 0, NULL); } return 0; } @@ -8569,12 +8569,13 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags, } void __dev_notify_flags(struct net_device *dev, unsigned int old_flags, - unsigned int gchanges) + unsigned int gchanges, u32 portid, + const struct nlmsghdr *nlh) { unsigned int changes = dev->flags ^ old_flags; if (gchanges) - rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC); + rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC, portid, nlh); if (changes & IFF_UP) { if (dev->flags & IFF_UP) @@ -8616,7 +8617,7 @@ int dev_change_flags(struct net_device *dev, unsigned int flags, return ret; changes = (old_flags ^ dev->flags) | (old_gflags ^ dev->gflags); - __dev_notify_flags(dev, old_flags, changes); + __dev_notify_flags(dev, old_flags, changes, 0, NULL); return ret; } EXPORT_SYMBOL(dev_change_flags); @@ -10101,7 +10102,7 @@ int register_netdevice(struct net_device *dev) */ if (!dev->rtnl_link_ops || dev->rtnl_link_state == RTNL_LINK_INITIALIZED) - rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL); + rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL, 0, NULL); out: return ret; @@ -10849,7 +10850,7 @@ void unregister_netdevice_many(struct list_head *head) if (!dev->rtnl_link_ops || dev->rtnl_link_state == RTNL_LINK_INITIALIZED) skb = rtmsg_ifinfo_build_skb(RTM_DELLINK, dev, ~0U, 0, - GFP_KERNEL, NULL, 0); + GFP_KERNEL, NULL, 0, 0, 0); /* * Flush the unicast and multicast chains @@ -10864,7 +10865,7 @@ void unregister_netdevice_many(struct list_head *head) dev->netdev_ops->ndo_uninit(dev); if (skb) - rtmsg_ifinfo_send(skb, dev, GFP_KERNEL); + rtmsg_ifinfo_send(skb, dev, GFP_KERNEL, 0, NULL); /* Notifier chain MUST detach us all upper devices. */ WARN_ON(netdev_has_any_upper_dev(dev)); @@ -11042,7 +11043,7 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net, * Prevent userspace races by waiting until the network * device is fully setup before sending notifications. */ - rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL); + rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL, 0, NULL); synchronize_net(); err = 0; -- cgit From 77f4aa9a2a1766a0b9343fd812b71f18d05178da Mon Sep 17 00:00:00 2001 From: Hangbin Liu Date: Fri, 28 Oct 2022 04:42:22 -0400 Subject: net: add new helper unregister_netdevice_many_notify Add new helper unregister_netdevice_many_notify(), pass netlink message header and portid, which could be used to notify userspace when flag NLM_F_ECHO is set. Make the unregister_netdevice_many() as a wrapper of new function unregister_netdevice_many_notify(). Suggested-by: Guillaume Nault Signed-off-by: Hangbin Liu Reviewed-by: Guillaume Nault Signed-off-by: Jakub Kicinski --- net/core/dev.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 19e0db536022..2e4f1c97b59e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10781,14 +10781,8 @@ void unregister_netdevice_queue(struct net_device *dev, struct list_head *head) } EXPORT_SYMBOL(unregister_netdevice_queue); -/** - * unregister_netdevice_many - unregister many devices - * @head: list of devices - * - * Note: As most callers use a stack allocated list_head, - * we force a list_del() to make sure stack wont be corrupted later. - */ -void unregister_netdevice_many(struct list_head *head) +void unregister_netdevice_many_notify(struct list_head *head, + u32 portid, const struct nlmsghdr *nlh) { struct net_device *dev, *tmp; LIST_HEAD(close_head); @@ -10850,7 +10844,8 @@ void unregister_netdevice_many(struct list_head *head) if (!dev->rtnl_link_ops || dev->rtnl_link_state == RTNL_LINK_INITIALIZED) skb = rtmsg_ifinfo_build_skb(RTM_DELLINK, dev, ~0U, 0, - GFP_KERNEL, NULL, 0, 0, 0); + GFP_KERNEL, NULL, 0, + portid, nlmsg_seq(nlh)); /* * Flush the unicast and multicast chains @@ -10865,7 +10860,7 @@ void unregister_netdevice_many(struct list_head *head) dev->netdev_ops->ndo_uninit(dev); if (skb) - rtmsg_ifinfo_send(skb, dev, GFP_KERNEL, 0, NULL); + rtmsg_ifinfo_send(skb, dev, GFP_KERNEL, portid, nlh); /* Notifier chain MUST detach us all upper devices. */ WARN_ON(netdev_has_any_upper_dev(dev)); @@ -10888,6 +10883,18 @@ void unregister_netdevice_many(struct list_head *head) list_del(head); } + +/** + * unregister_netdevice_many - unregister many devices + * @head: list of devices + * + * Note: As most callers use a stack allocated list_head, + * we force a list_del() to make sure stack wont be corrupted later. + */ +void unregister_netdevice_many(struct list_head *head) +{ + unregister_netdevice_many_notify(head, 0, NULL); +} EXPORT_SYMBOL(unregister_netdevice_many); /** -- cgit From 02a68a47eadedf95748facfca6ced31fb0181d52 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Wed, 2 Nov 2022 17:02:03 +0100 Subject: net: devlink: track netdev with devlink_port assigned Currently, ethernet drivers are using devlink_port_type_eth_set() and devlink_port_type_clear() to set devlink port type and link to related netdev. Instead of calling them directly, let the driver use SET_NETDEV_DEVLINK_PORT macro to assign devlink_port pointer and let devlink to track it. Note the devlink port pointer is static during the time netdevice is registered. In devlink code, use per-namespace netdev notifier to track the netdevices with devlink_port assigned and change the internal devlink_port type and related type pointer accordingly. Signed-off-by: Jiri Pirko Signed-off-by: Jakub Kicinski --- net/core/dev.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 2e4f1c97b59e..3bacee3bee78 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1621,10 +1621,10 @@ const char *netdev_cmd_to_name(enum netdev_cmd cmd) N(UP) N(DOWN) N(REBOOT) N(CHANGE) N(REGISTER) N(UNREGISTER) N(CHANGEMTU) N(CHANGEADDR) N(GOING_DOWN) N(CHANGENAME) N(FEAT_CHANGE) N(BONDING_FAILOVER) N(PRE_UP) N(PRE_TYPE_CHANGE) N(POST_TYPE_CHANGE) - N(POST_INIT) N(RELEASE) N(NOTIFY_PEERS) N(JOIN) N(CHANGEUPPER) - N(RESEND_IGMP) N(PRECHANGEMTU) N(CHANGEINFODATA) N(BONDING_INFO) - N(PRECHANGEUPPER) N(CHANGELOWERSTATE) N(UDP_TUNNEL_PUSH_INFO) - N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN) + N(POST_INIT) N(PRE_UNINIT) N(RELEASE) N(NOTIFY_PEERS) N(JOIN) + N(CHANGEUPPER) N(RESEND_IGMP) N(PRECHANGEMTU) N(CHANGEINFODATA) + N(BONDING_INFO) N(PRECHANGEUPPER) N(CHANGELOWERSTATE) + N(UDP_TUNNEL_PUSH_INFO) N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN) N(CVLAN_FILTER_PUSH_INFO) N(CVLAN_FILTER_DROP_INFO) N(SVLAN_FILTER_PUSH_INFO) N(SVLAN_FILTER_DROP_INFO) N(PRE_CHANGEADDR) N(OFFLOAD_XSTATS_ENABLE) N(OFFLOAD_XSTATS_DISABLE) @@ -10060,7 +10060,7 @@ int register_netdevice(struct net_device *dev) dev->reg_state = ret ? NETREG_UNREGISTERED : NETREG_REGISTERED; write_unlock(&dev_base_lock); if (ret) - goto err_uninit; + goto err_uninit_notify; __netdev_update_features(dev); @@ -10107,6 +10107,8 @@ int register_netdevice(struct net_device *dev) out: return ret; +err_uninit_notify: + call_netdevice_notifiers(NETDEV_PRE_UNINIT, dev); err_uninit: if (dev->netdev_ops->ndo_uninit) dev->netdev_ops->ndo_uninit(dev); @@ -10856,6 +10858,8 @@ void unregister_netdevice_many_notify(struct list_head *head, netdev_name_node_alt_flush(dev); netdev_name_node_free(dev->name_node); + call_netdevice_notifiers(NETDEV_PRE_UNINIT, dev); + if (dev->netdev_ops->ndo_uninit) dev->netdev_ops->ndo_uninit(dev); -- cgit From bd039b5ea2a91ea707ee8539df26456bd5be80af Mon Sep 17 00:00:00 2001 From: Andy Ren Date: Mon, 7 Nov 2022 09:42:42 -0800 Subject: net/core: Allow live renaming when an interface is up Allow a network interface to be renamed when the interface is up. As described in the netconsole documentation [1], when netconsole is used as a built-in, it will bring up the specified interface as soon as possible. As a result, user space will not be able to rename the interface since the kernel disallows renaming of interfaces that are administratively up unless the 'IFF_LIVE_RENAME_OK' private flag was set by the kernel. The original solution [2] to this problem was to add a new parameter to the netconsole configuration parameters that allows renaming of the interface used by netconsole while it is administratively up. However, during the discussion that followed, it became apparent that we have no reason to keep the current restriction and instead we should allow user space to rename interfaces regardless of their administrative state: 1. The restriction was put in place over 20 years ago when renaming was only possible via IOCTL and before rtnetlink started notifying user space about such changes like it does today. 2. The 'IFF_LIVE_RENAME_OK' flag was added over 3 years ago in version 5.2 and no regressions were reported. 3. In-kernel listeners to 'NETDEV_CHANGENAME' do not seem to care about the administrative state of interface. Therefore, allow user space to rename running interfaces by removing the restriction and the associated 'IFF_LIVE_RENAME_OK' flag. Help in possible triage by emitting a message to the kernel log that an interface was renamed while UP. [1] https://www.kernel.org/doc/Documentation/networking/netconsole.rst [2] https://lore.kernel.org/netdev/20221102002420.2613004-1-andy.ren@getcruise.com/ Signed-off-by: Andy Ren Reviewed-by: Ido Schimmel Reviewed-by: David Ahern Signed-off-by: David S. Miller --- net/core/dev.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 3bacee3bee78..707de6b841d0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1163,22 +1163,6 @@ int dev_change_name(struct net_device *dev, const char *newname) net = dev_net(dev); - /* Some auto-enslaved devices e.g. failover slaves are - * special, as userspace might rename the device after - * the interface had been brought up and running since - * the point kernel initiated auto-enslavement. Allow - * live name change even when these slave devices are - * up and running. - * - * Typically, users of these auto-enslaving devices - * don't actually care about slave name change, as - * they are supposed to operate on master interface - * directly. - */ - if (dev->flags & IFF_UP && - likely(!(dev->priv_flags & IFF_LIVE_RENAME_OK))) - return -EBUSY; - down_write(&devnet_rename_sem); if (strncmp(newname, dev->name, IFNAMSIZ) == 0) { @@ -1195,7 +1179,8 @@ int dev_change_name(struct net_device *dev, const char *newname) } if (oldname[0] && !strchr(oldname, '%')) - netdev_info(dev, "renamed from %s\n", oldname); + netdev_info(dev, "renamed from %s%s\n", oldname, + dev->flags & IFF_UP ? " (while UP)" : ""); old_assign_type = dev->name_assign_type; dev->name_assign_type = NET_NAME_RENAMED; -- cgit From 3e52fba03a20234abc65a656cef063a1045d9723 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Tue, 8 Nov 2022 14:22:06 +0100 Subject: net: introduce a helper to move notifier block to different namespace Currently, net_dev() netdev notifier variant follows the netdev with per-net notifier from namespace to namespace. This is implemented by move_netdevice_notifiers_dev_net() helper. For devlink it is needed to re-register per-net notifier during devlink reload. Introduce a new helper called move_netdevice_notifier_net() and share the unregister/register code with existing move_netdevice_notifiers_dev_net() helper. Signed-off-by: Jiri Pirko Reviewed-by: Ido Schimmel Signed-off-by: Jakub Kicinski --- net/core/dev.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 707de6b841d0..117e830cabb0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1861,6 +1861,22 @@ int unregister_netdevice_notifier_net(struct net *net, } EXPORT_SYMBOL(unregister_netdevice_notifier_net); +static void __move_netdevice_notifier_net(struct net *src_net, + struct net *dst_net, + struct notifier_block *nb) +{ + __unregister_netdevice_notifier_net(src_net, nb); + __register_netdevice_notifier_net(dst_net, nb, true); +} + +void move_netdevice_notifier_net(struct net *src_net, struct net *dst_net, + struct notifier_block *nb) +{ + rtnl_lock(); + __move_netdevice_notifier_net(src_net, dst_net, nb); + rtnl_unlock(); +} + int register_netdevice_notifier_dev_net(struct net_device *dev, struct notifier_block *nb, struct netdev_net_notifier *nn) @@ -1897,10 +1913,8 @@ static void move_netdevice_notifiers_dev_net(struct net_device *dev, { struct netdev_net_notifier *nn; - list_for_each_entry(nn, &dev->net_notifier_list, list) { - __unregister_netdevice_notifier_net(dev_net(dev), nn->nb); - __register_netdevice_notifier_net(net, nn->nb, true); - } + list_for_each_entry(nn, &dev->net_notifier_list, list) + __move_netdevice_notifier_net(dev_net(dev), net, nn->nb); } /** -- cgit From 6af645a5b2da7898b2c37e8f39f1bc572c4ab85a Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Nov 2022 09:10:58 +0000 Subject: net: net_{enable|disable}_timestamp() optimizations Adopting atomic_try_cmpxchg() makes the code cleaner. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/dev.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 117e830cabb0..10b56648a9d4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2073,13 +2073,10 @@ static DECLARE_WORK(netstamp_work, netstamp_clear); void net_enable_timestamp(void) { #ifdef CONFIG_JUMP_LABEL - int wanted; + int wanted = atomic_read(&netstamp_wanted); - while (1) { - wanted = atomic_read(&netstamp_wanted); - if (wanted <= 0) - break; - if (atomic_cmpxchg(&netstamp_wanted, wanted, wanted + 1) == wanted) + while (wanted > 0) { + if (atomic_try_cmpxchg(&netstamp_wanted, &wanted, wanted + 1)) return; } atomic_inc(&netstamp_needed_deferred); @@ -2093,13 +2090,10 @@ EXPORT_SYMBOL(net_enable_timestamp); void net_disable_timestamp(void) { #ifdef CONFIG_JUMP_LABEL - int wanted; + int wanted = atomic_read(&netstamp_wanted); - while (1) { - wanted = atomic_read(&netstamp_wanted); - if (wanted <= 1) - break; - if (atomic_cmpxchg(&netstamp_wanted, wanted, wanted - 1) == wanted) + while (wanted > 1) { + if (atomic_try_cmpxchg(&netstamp_wanted, &wanted, wanted - 1)) return; } atomic_dec(&netstamp_needed_deferred); -- cgit From 1462160c7455cfc495693e47d0eabb371ec4f7ff Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Nov 2022 09:10:59 +0000 Subject: net: adopt try_cmpxchg() in napi_schedule_prep() and napi_complete_done() This makes the code slightly more efficient. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/dev.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 10b56648a9d4..0c12c7ad04f2 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5979,10 +5979,9 @@ EXPORT_SYMBOL(__napi_schedule); */ bool napi_schedule_prep(struct napi_struct *n) { - unsigned long val, new; + unsigned long new, val = READ_ONCE(n->state); do { - val = READ_ONCE(n->state); if (unlikely(val & NAPIF_STATE_DISABLE)) return false; new = val | NAPIF_STATE_SCHED; @@ -5995,7 +5994,7 @@ bool napi_schedule_prep(struct napi_struct *n) */ new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * NAPIF_STATE_MISSED; - } while (cmpxchg(&n->state, val, new) != val); + } while (!try_cmpxchg(&n->state, &val, new)); return !(val & NAPIF_STATE_SCHED); } @@ -6063,9 +6062,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done) local_irq_restore(flags); } + val = READ_ONCE(n->state); do { - val = READ_ONCE(n->state); - WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED)); new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED | @@ -6078,7 +6076,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done) */ new |= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED * NAPIF_STATE_SCHED; - } while (cmpxchg(&n->state, val, new) != val); + } while (!try_cmpxchg(&n->state, &val, new)); if (unlikely(val & NAPIF_STATE_MISSED)) { __napi_schedule(n); -- cgit From 4ffa1d1c6842a97e84cfbe56bfcf70edb23608e2 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Nov 2022 09:11:00 +0000 Subject: net: adopt try_cmpxchg() in napi_{enable|disable}() This makes code a bit cleaner. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/dev.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 0c12c7ad04f2..fb943dad9651 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6397,8 +6397,8 @@ void napi_disable(struct napi_struct *n) might_sleep(); set_bit(NAPI_STATE_DISABLE, &n->state); - for ( ; ; ) { - val = READ_ONCE(n->state); + val = READ_ONCE(n->state); + do { if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { usleep_range(20, 200); continue; @@ -6406,10 +6406,7 @@ void napi_disable(struct napi_struct *n) new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; new &= ~(NAPIF_STATE_THREADED | NAPIF_STATE_PREFER_BUSY_POLL); - - if (cmpxchg(&n->state, val, new) == val) - break; - } + } while (!try_cmpxchg(&n->state, &val, new)); hrtimer_cancel(&n->timer); @@ -6426,16 +6423,15 @@ EXPORT_SYMBOL(napi_disable); */ void napi_enable(struct napi_struct *n) { - unsigned long val, new; + unsigned long new, val = READ_ONCE(n->state); do { - val = READ_ONCE(n->state); BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC); if (n->dev->threaded && n->thread) new |= NAPIF_STATE_THREADED; - } while (cmpxchg(&n->state, val, new) != val); + } while (!try_cmpxchg(&n->state, &val, new)); } EXPORT_SYMBOL(napi_enable); -- cgit From 6c1c5097781f563b70a81683ea6fdac21637573b Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Tue, 15 Nov 2022 08:53:55 +0000 Subject: net: add atomic_long_t to net_device_stats fields Long standing KCSAN issues are caused by data-race around some dev->stats changes. Most performance critical paths already use per-cpu variables, or per-queue ones. It is reasonable (and more correct) to use atomic operations for the slow paths. This patch adds an union for each field of net_device_stats, so that we can convert paths that are not yet protected by a spinlock or a mutex. netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64 Note that the memcpy() we were using on 64bit arches had no provision to avoid load-tearing, while atomic_long_read() is providing the needed protection at no cost. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/dev.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index fb943dad9651..d0fb4af9a126 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10369,24 +10369,16 @@ void netdev_run_todo(void) void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64, const struct net_device_stats *netdev_stats) { -#if BITS_PER_LONG == 64 - BUILD_BUG_ON(sizeof(*stats64) < sizeof(*netdev_stats)); - memcpy(stats64, netdev_stats, sizeof(*netdev_stats)); - /* zero out counters that only exist in rtnl_link_stats64 */ - memset((char *)stats64 + sizeof(*netdev_stats), 0, - sizeof(*stats64) - sizeof(*netdev_stats)); -#else - size_t i, n = sizeof(*netdev_stats) / sizeof(unsigned long); - const unsigned long *src = (const unsigned long *)netdev_stats; + size_t i, n = sizeof(*netdev_stats) / sizeof(atomic_long_t); + const atomic_long_t *src = (atomic_long_t *)netdev_stats; u64 *dst = (u64 *)stats64; BUILD_BUG_ON(n > sizeof(*stats64) / sizeof(u64)); for (i = 0; i < n; i++) - dst[i] = src[i]; + dst[i] = atomic_long_read(&src[i]); /* zero out counters that only exist in rtnl_link_stats64 */ memset((char *)stats64 + n * sizeof(u64), 0, sizeof(*stats64) - n * sizeof(u64)); -#endif } EXPORT_SYMBOL(netdev_stats_to_stats64); -- cgit From fd896e38e5df2c5b68c78eee2fc425c4dcd3b4dd Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 17 Nov 2022 09:26:41 +0000 Subject: net: fix napi_disable() logic error Dan reported a new warning after my recent patch: New smatch warnings: net/core/dev.c:6409 napi_disable() error: uninitialized symbol 'new'. Indeed, we must first wait for STATE_SCHED and STATE_NPSVC to be cleared, to make sure @new variable has been initialized properly. Fixes: 4ffa1d1c6842 ("net: adopt try_cmpxchg() in napi_{enable|disable}()") Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller --- net/core/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index d0fb4af9a126..7627c475d991 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6399,9 +6399,9 @@ void napi_disable(struct napi_struct *n) val = READ_ONCE(n->state); do { - if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { + while (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { usleep_range(20, 200); - continue; + val = READ_ONCE(n->state); } new = val | NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC; -- cgit From d93607082e982223cf92750f2d9039ff365b9d24 Mon Sep 17 00:00:00 2001 From: Heiner Kallweit Date: Wed, 30 Nov 2022 23:28:26 +0100 Subject: net: add netdev_sw_irq_coalesce_default_on() Add a helper for drivers wanting to set SW IRQ coalescing by default. The related sysfs attributes can be used to override the default values. Follow Jakub's suggestion and put this functionality into net core so that drivers wanting to use software interrupt coalescing per default don't have to open-code it. Note that this function needs to be called before the netdevice is registered. Suggested-by: Jakub Kicinski Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller --- net/core/dev.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'net/core/dev.c') diff --git a/net/core/dev.c b/net/core/dev.c index 7627c475d991..b76fb37b381e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10517,6 +10517,22 @@ void netdev_set_default_ethtool_ops(struct net_device *dev, } EXPORT_SYMBOL_GPL(netdev_set_default_ethtool_ops); +/** + * netdev_sw_irq_coalesce_default_on() - enable SW IRQ coalescing by default + * @dev: netdev to enable the IRQ coalescing on + * + * Sets a conservative default for SW IRQ coalescing. Users can use + * sysfs attributes to override the default values. + */ +void netdev_sw_irq_coalesce_default_on(struct net_device *dev) +{ + WARN_ON(dev->reg_state == NETREG_REGISTERED); + + dev->gro_flush_timeout = 20000; + dev->napi_defer_hard_irqs = 1; +} +EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on); + void netdev_freemem(struct net_device *dev) { char *addr = (char *)dev - dev->padded; -- cgit