summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-01-25svcrdma: Convert rdma_stat_sq_starve to a per-CPU counterChuck Lever
Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_recv to a per-CPU counterChuck Lever
Receives are frequent events. Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Refactor svc_rdma_init() and svc_rdma_clean_up()Chuck Lever
Setting up the proc variables is about to get more complicated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25Merge 5.11-rc5 into tty-nextGreg Kroah-Hartman
We need the fixes in here and this resolves a merge issue in drivers/tty/tty_io.c Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24af_unix: handle idmapped mountsChristian Brauner
When binding a non-abstract AF_UNIX socket it will gain a representation in the filesystem. Enable the socket infrastructure to handle idmapped mounts by passing down the user namespace of the mount the socket will be created from. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-18-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24namei: prepare for idmapped mountsChristian Brauner
The various vfs_*() helpers are called by filesystems or by the vfs itself to perform core operations such as create, link, mkdir, mknod, rename, rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace and pass it down. Afterwards the checks and operations are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24fs: add file and path permissions helpersChristian Brauner
Add two simple helpers to check permissions on a file and path respectively and convert over some callers. It simplifies quite a few codepaths and also reduces the churn in later patches quite a bit. Christoph also correctly points out that this makes codepaths (e.g. ioctls) way easier to follow that would otherwise have to do more complex argument passing than necessary. Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-23tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPENPengcheng Yang
Upon receiving a cumulative ACK that changes the congestion state from Disorder to Open, the TLP timer is not set. If the sender is app-limited, it can only wait for the RTO timer to expire and retransmit. The reason for this is that the TLP timer is set before the congestion state changes in tcp_ack(), so we delay the time point of calling tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer is set. This commit has two additional benefits: 1) Make sure to reset RTO according to RFC6298 when receiving ACK, to avoid spurious RTO caused by RTO timer early expires. 2) Reduce the xmit timer reschedule once per ACK when the RACK reorder timer is set. Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed") Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23udp: allow forwarding of plain (non-fraglisted) UDP GRO packetsAlexander Lobakin
Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually not only added a support for fraglisted UDP GRO, but also tweaked some logics the way that non-fraglisted UDP GRO started to work for forwarding too. Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the list of software-backed types") added GSO UDP L4 to the list of software GSO to allow virtual netdevs to forward them as is up to the real drivers. Tests showed that currently forwarding and NATing of plain UDP GRO packets are performed fully correctly, regardless if the target netdevice has a support for hardware/driver GSO UDP L4 or not. Add the last element and allow to form plain UDP GRO packets if we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is enabled on a receiving netdevice. If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set, fraglisted GRO takes precedence. This keeps the current behaviour and is generally more optimal for now, as the number of NICs with hardware USO offload is relatively small. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23net: introduce a netdev feature for UDP GRO forwardingAlexander Lobakin
Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user to turn UDP GRO on and off for forwarding. Defaults to off to not change current datapath. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23tcp: make TCP_USER_TIMEOUT accurate for zero window probesEnke Chen
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the timer has backoff with a max interval of about two minutes, the actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes. In this patch the TCP_USER_TIMEOUT is made more accurate by taking it into account when computing the timer value for the 0-window probes. This patch is similar to and builds on top of the one that made TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy"). Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT") Signed-off-by: Enke Chen <enchen@paloaltonetworks.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23NFC: fix resource leak when target index is invalidPan Bian
Goto to the label put_dev instead of the label error to fix potential resource leak on path that the target index is invalid. Fixes: c4fbb6515a4d ("NFC: The core part should generate the target index") Signed-off-by: Pan Bian <bianpan2016@163.com> Link: https://lore.kernel.org/r/20210121152748.98409-1-bianpan2016@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23NFC: fix possible resource leakPan Bian
Put the device to avoid resource leak on path that the polling flag is invalid. Fixes: a831b9132065 ("NFC: Do not return EBUSY when stopping a poll that's already stopped") Signed-off-by: Pan Bian <bianpan2016@163.com> Link: https://lore.kernel.org/r/20210121153745.122184-1-bianpan2016@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23net: mrp: move struct definitions out of uapiRasmus Villemoes
None of these are actually used in the kernel/userspace interface - there's a userspace component of implementing MRP, and userspace will need to construct certain frames to put on the wire, but there's no reason the kernel should provide the relevant definitions in a UAPI header. In fact, some of those definitions were broken until previous commit, so only keep the few that are actually referenced in the kernel code, and move them to the br_private_mrp.h header. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mlxsw: Register physical ports as a devlink resourceDanielle Ratson
The switch ASIC has a limited capacity of physical ('flavour physical' in devlink terminology) ports that it can support. While each system is brought up with a different number of ports, this number can be increased via splitting up to the ASIC's limit. Expose physical ports as a devlink resource so that user space will have visibility to the maximum number of ports that can be supported and the current occupancy. In addition, add a "Generic Resources" section in devlink-resource documentation so the different drivers will be aligned by the same resource name when exposing to user space. Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22sch_htb: Stats for offloaded HTBMaxim Mikityanskiy
This commit adds support for statistics of offloaded HTB. Bytes and packets counters for leaf and inner nodes are supported, the values are taken from per-queue qdiscs, and the numbers that the user sees should have the same behavior as the software (non-offloaded) HTB. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22sch_htb: Hierarchical QoS hardware offloadMaxim Mikityanskiy
HTB doesn't scale well because of contention on a single lock, and it also consumes CPU. This patch adds support for offloading HTB to hardware that supports hierarchical rate limiting. In the offload mode, HTB passes control commands to the driver using ndo_setup_tc. The driver has to replicate the whole hierarchy of classes and their settings (rate, ceil) in the NIC. Every modification of the HTB tree caused by the admin results in ndo_setup_tc being called. After this setup, the HTB algorithm is done completely in the NIC. An SQ (send queue) is created for every leaf class and attached to the hierarchy, so that the NIC can calculate and obey aggregated rate limits, too. In the future, it can be changed, so that multiple SQs will back a single leaf class. ndo_select_queue is responsible for selecting the right queue that serves the traffic class of each packet. The data path works as follows: a packet is classified by clsact, the driver selects a hardware queue according to its class, and the packet is enqueued into this queue's qdisc. This solution addresses two main problems of scaling HTB: 1. Contention by flow classification. Currently the filters are attached to the HTB instance as follows: # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80 classid 1:10 It's possible to move classification to clsact egress hook, which is thread-safe and lock-free: # tc filter add dev eth0 egress protocol ip flower dst_port 80 action skbedit priority 1:10 This way classification still happens in software, but the lock contention is eliminated, and it happens before selecting the TX queue, allowing the driver to translate the class to the corresponding hardware queue in ndo_select_queue. Note that this is already compatible with non-offloaded HTB and doesn't require changes to the kernel nor iproute2. 2. Contention by handling packets. HTB is not multi-queue, it attaches to a whole net device, and handling of all packets takes the same lock. When HTB is offloaded, it registers itself as a multi-queue qdisc, similarly to mq: HTB is attached to the netdev, and each queue has its own qdisc. Some features of HTB may be not supported by some particular hardware, for example, the maximum number of classes may be limited, the granularity of rate and ceil parameters may be different, etc. - so, the offload is not enabled by default, a new parameter is used to enable it: # tc qdisc replace dev eth0 root handle 1: htb offload Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: sched: Add extack to Qdisc_class_ops.deleteMaxim Mikityanskiy
In a following commit, sch_htb will start using extack in the delete class operation to pass hardware errors in offload mode. This commit prepares for that by adding the extack parameter to this callback and converting usage of the existing qdiscs. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22tcp: Add receive timestamp support for receive zerocopy.Arjun Roy
tcp_recvmsg() uses the CMSG mechanism to receive control information like packet receive timestamps. This patch adds CMSG fields to struct tcp_zerocopy_receive, and provides receive timestamps if available to the user. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22tcp: Remove CMSG magic numbers for tcp_recvmsg().Arjun Roy
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending and what those CMSGs are. These flags are currently magic numbers, used only within tcp_recvmsg(). To prepare for receive timestamp support in tcp receive zerocopy, gently refactor these magic numbers into enums. Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletesNikolay Aleksandrov
Mark groups which were deleted due to fast leave/EHT. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: handle block pg delete for all casesNikolay Aleksandrov
A block report can result in empty source and host sets for both include and exclude groups so if there are no hosts left we can safely remove the group. Pull the block group handling so it can cover both cases and add a check if EHT requires the delete. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT host filter_mode handlingNikolay Aleksandrov
We should be able to handle host filter mode changing. For exclude mode we must create a zero-src entry so the group will be kept even without any S,G entries (non-zero source sets). That entry doesn't count to the entry limit and can always be created, its timer is refreshed on new exclude reports and if we change the host filter mode to include then it gets removed and we rely only on the non-zero source sets. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: optimize TO_INCLUDE EHT timeoutsNikolay Aleksandrov
This is an optimization specifically for TO_INCLUDE which sends queries for the older entries and thus lowers the S,G timers to LMQT. If we have the following situation for a group in either include or exclude mode: - host A was interested in srcs X and Y, but is timing out - host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts to LMQT - host B sends BLOCK src Z after LMQT time has passed => since host B is the last host we can delete the group, but if we still have host A's EHT entries for X and Y (i.e. if they weren't lowered to LMQT previously) then we'll have to wait another LMQT time before deleting the group, with this optimization we can directly remove it regardless of the group mode as there are no more interested hosts Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT include and exclude handlingNikolay Aleksandrov
Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to how the reports are processed we have 2 cases when the group is in include or exclude mode, these are processed as follows: - group include - is_include: create missing entries - to_include: flush existing entries and create a new set from the report, obviously if the src set is empty then we delete the group - group exclude - is_exclude: create missing entries - to_exclude: flush existing entries and create a new set from the report, any empty source set entries are removed If the group is in a different mode then we just flush all entries reported by the host and we create a new set with the new mode entries created from the report. If the report is include type, the source list is empty and the group has empty sources' set then we remove it. Any source set entries which are empty are removed as well. If the group is in exclude mode it can exist without any S,G entries (allowing for all traffic to pass). Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT allow/block handlingNikolay Aleksandrov
Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how the reports are processed we have 2 cases when the group is in include or exclude mode, these are processed as follows: - group include - allow: create missing entries - block: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries, then possibly delete the whole group if there are no more S,G entries - group exclude - allow - host include: create missing entries - host exclude: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries - block - host include: remove existing matching entries and remove the corresponding S,G entries if there are no more set host entries, then possibly delete the whole group if there are no more S,G entries - host exclude: create missing entries Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT host delete functionNikolay Aleksandrov
Now that we can delete set entries, we can use that to remove EHT hosts. Since the group's host set entries exist only when there are related source set entries we just have to flush all source set entries joined by the host set entry and it will be automatically removed. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT source set handling functionsNikolay Aleksandrov
Add EHT source set and set-entry create, delete and lookup functions. These allow to manipulate source sets which contain their own host sets with entries which joined that S,G. We're limiting the maximum number of tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is the current maximum of S,G entries for a group. There's a per-set timer which will be used to destroy the whole set later. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT host handling functionsNikolay Aleksandrov
Add functions to create, destroy and lookup an EHT host. These are per-host entries contained in the eht_host_tree in net_bridge_port_group which are used to store a list of all sources (S,G) entries joined for that group by each host, the host's current filter mode and total number of joined entries. No functional changes yet, these would be used in later patches. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: add EHT structures and definitionsNikolay Aleksandrov
Add EHT structures for tracking hosts and sources per group. We keep one set for each host which has all of the host's S,G entries, and one set for each multicast source which has all hosts that have joined that S,G. For each host, source entry we record the filter_mode and we keep an expiry timer. There is also one global expiry timer per source set, it is updated with each set entry update, it will be later used to lower the set's timer instead of lowering each entry's timer separately. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: calculate idx position without changing ptrNikolay Aleksandrov
We need to preserve the srcs pointer since we'll be passing it for EHT handling later. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: __grp_src_block_incl can modify pgNikolay Aleksandrov
Prepare __grp_src_block_incl() for being able to cause a notification due to changes. Currently it cannot happen, but EHT would change that since we'll be deleting sources immediately. Make sure that if the pg is deleted we don't return true as that would cause the caller to access freed pg. This patch shouldn't cause any functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: pass host src address to IGMPv3/MLDv2 functionsNikolay Aleksandrov
We need to pass the host address so later it can be used for explicit host tracking. No functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22net: bridge: multicast: rename src_size to addr_sizeNikolay Aleksandrov
Rename src_size argument to addr_size in preparation for passing host address as an argument to IGMPv3/MLDv2 functions. No functional change. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: implement delegated actionsPaolo Abeni
On MPTCP-level ack reception, the packet scheduler may select a subflow other then the current one. Prior to this commit we rely on the workqueue to trigger action on such subflow. This changeset introduces an infrastructure that allows any MPTCP subflow to schedule actions (MPTCP xmit) on others subflows without resorting to (multiple) process reschedule. A dummy NAPI instance is used instead. When MPTCP needs to trigger action an a different subflow, it enqueues the target subflow on the NAPI backlog and schedule such instance as needed. The dummy NAPI poll method walks the sockets backlog and tries to acquire the (BH) socket lock on each of them. If the socket is owned by the user space, the action will be completed by the sock release cb, otherwise push is started. This change leverages the delegated action infrastructure to avoid invoking the MPTCP worker to spool the pending data, when the packet scheduler picks a subflow other then the one currently processing the incoming MPTCP-level ack. Additionally we further refine the subflow selection invoking the packet scheduler for each chunk of data even inside __mptcp_subflow_push_pending(). v1 -> v2: - fix possible UaF at shutdown time, resetting sock ops after removing the ulp context Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: schedule work for better snd subflow selectionPaolo Abeni
Otherwise the packet scheduler policy will not be enforced when pushing pending data at MPTCP-level ack reception time. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: do not queue excessive data on subflowsPaolo Abeni
The current packet scheduler can enqueue up to sndbuf data on each subflow. If the send buffer is large and the subflows are not symmetric, this could lead to suboptimal aggregate bandwidth utilization. Limit the amount of queued data to the maximum send window. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: re-enable sndbuf autotunePaolo Abeni
After commit 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE on its subflow. As a side effect, autotune never takes place, as it happens inside tcp_new_space(), which in turn is called only when the mentioned bit is set. Let's sendmsg() set the subflows NOSPACE bit when looking for more memory and use the subflow write_space callback to propagate the snd buf update and wake-up the user-space. Additionally, this allows dropping a bunch of duplicate code and makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows. Fixes: 6e628cd3a8f7 ("mptcp: use mptcp release_cb for delayed tasks") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22mptcp: always graft subflow socket to parentPaolo Abeni
Currently, incoming subflows link to the parent socket, while outgoing ones link to a per subflow socket. The latter is not really needed, except at the initial connect() time and for the first subflow. Always graft the outgoing subflow to the parent socket and free the unneeded ones early. This allows some code cleanup, reduces the amount of memory used and will simplify the next patch Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22tcp: add TTL to SCM_TIMESTAMPING_OPT_STATSYousuk Seung
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports the time-to-live or hop limit of the latest incoming packet with SCM_TSTAMP_ACK. The value exported may not be from the packet that acks the sequence when incoming packets are aggregated. Exporting the time-to-live or hop limit value of incoming packets helps to estimate the hop count of the path of the flow that may change over time. Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A patch to zero out sensitive cryptographic data and two minor cleanups prompted by the fact that a bunch of code was moved in this cycle" * tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client: libceph: fix "Boolean result is used in bitwise operation" warning libceph, ceph: disambiguate ceph_connection_operations handlers libceph: zero out session key and connection secret
2021-01-22devlink: Support get and set state of port functionParav Pandit
devlink port function can be in active or inactive state. Allow users to get and set port function's state. When the port function it activated, its operational state may change after a while when the device is created and driver binds to it. Similarly on deactivation flow. To clearly describe the state of the port function and its device's operational state in the host system, define state and opstate attributes. Example of a PCI SF port which supports a port function: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state inactive opstate detached $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "external": false, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Support add and delete devlink portParav Pandit
Extended devlink interface for the user to add and delete a port. Extend devlink to connect user requests to driver to add/delete a port in the device. Driver routines are invoked without holding devlink instance lock. This enables driver to perform several devlink objects registration, unregistration such as (port, health reporter, resource etc) by using existing devlink APIs. This also helps to uniformly use the code for port unregistration during driver unload and during port deletion initiated by user. Examples of add, show and delete commands: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ udevadm test-builtin net_id /sys/class/net/eth6 Load module index Parsed configuration file /usr/lib/systemd/network/99-default.link Created link configuration context. Using default interface naming scheme 'v245'. ID_NET_NAMING_SCHEME=v245 ID_NET_NAME_PATH=enp6s0f0npf0sf88 ID_NET_NAME_SLOT=ens2f0npf0sf88 Unload module index Unloaded link configuration context. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Introduce PCI SF port flavour and port attributeParav Pandit
A PCI sub-function (SF) represents a portion of the device similar to PCI VF. In an eswitch, PCI SF may have port which is normally represented using a representor netdevice. To have better visibility of eswitch port, its association with SF, and its representor netdevice, introduce a PCI SF port flavour. When devlink port flavour is PCI SF, fill up PCI SF attributes of the port. Extend port name creation using PCI PF and SF number scheme on best effort basis, so that vendor drivers can skip defining their own scheme. This is done as cApfNSfM, where A, N and M are controller, PCI PF and PCI SF number respectively. This is similar to existing naming for PCI PF and PCI VF ports. An example view of a PCI SF port: $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state active opstate attached $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Prepare code to fill multiple port function attributesParav Pandit
Prepare code to fill zero or more port function optional attributes. Subsequent patch makes use of this to fill more port function attributes. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22cfg80211: change netdev registration/unregistration semanticsJohannes Berg
We used to not require anything in terms of registering netdevs with cfg80211, using a netdev notifier instead. However, in the next patch reducing RTNL locking, this causes big problems, and the simplest way is to just require drivers to do things better. Change the registration/unregistration semantics to require the drivers to call cfg80211_(un)register_netdevice() when this is happening due to a cfg80211 request, i.e. add_virtual_intf() or del_virtual_intf() (or if it somehow has to happen in any other cfg80211 callback). Otherwise, in other contexts, drivers may continue to use the normal netdev (un)registration functions as usual. Internally, we still use the netdev notifier and track (by the new wdev->registered bool) if the wdev had already been added to cfg80211 or not. Link: https://lore.kernel.org/r/20210122161942.cf2f4b65e4e9.Ida8234e50da13eb675b557bac52a713ad4eddf71@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: fix rounding error in throughput calculationFelix Fietkau
On lower data rates, the throughput calculation has a significant rounding error, causing rates like 48M and 54M OFDM to share the same throughput value with >= 90% success probablity. This is because the result of the division (prob_avg * 1000) / nsecs is really small (8 in this example). Improve accuracy by moving over some zeroes, making better use of the full range of u32 before the division. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-10-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22mac80211: minstrel_ht: increase stats update intervalFelix Fietkau
The shorter interval was leading to too many frames being used for probing Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210115120242.89616-9-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>