summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2020-04-02neigh: support smaller retrans_time setttingHangbin Liu
Currently, we limited the retrans_time to be greater than HZ/2. i.e. setting retrans_time less than 500ms will not work. This makes the user unable to achieve a more accurate control for bonding arp fast failover. Update the sanity check to HZ/100, which is 10ms, to let users have more ability on the retrans_time control. v3: sync the behavior with IPv6 and update all the timer handler v2: use HZ instead of hard code number Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02net: core: enable SO_BINDTODEVICE for non-root usersVincent Bernat
Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a non-root user to bind a socket to an interface if it is not already bound. This is useful to allow an application to bind itself to a specific VRF for outgoing or incoming connections. Currently, an application wanting to manage connections through several VRF need to be privileged. Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for Wine (76e21053b5bf3 and c4062dfc425e9) specifically for use by non-root processes. However, they are restricted to sendmsg() and not usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to get the same privilege. As for TCP servers, outside the VRF use case, SO_BINDTODEVICE would only further restrict connections a server could accept. When an application is restricted to a VRF (with `ip vrf exec`), the socket is bound to an interface at creation and therefore, a non-privileged call to SO_BINDTODEVICE to escape the VRF fails. When an application bound a socket to SO_BINDTODEVICE and transmit it to a non-privileged process through a Unix socket, a tentative to change the bound device also fails. Before: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted After: >>> import socket >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM) >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0") Traceback (most recent call last): File "<stdin>", line 1, in <module> PermissionError: [Errno 1] Operation not permitted Signed-off-by: Vincent Bernat <vincent@bernat.ch> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Add support to specify a stateful expression in set definitions, this allows users to specify e.g. counters per set elements. 2) Flowtable software counter support. 3) Flowtable hardware offload counter support, from wenxu. 3) Parallelize flowtable hardware offload requests, from Paul Blakey. This includes a patch to add one work entry per offload command. 4) Several patches to rework nf_queue refcount handling, from Florian Westphal. 4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling information is missing and set up indirect flow block as TC_SETUP_FT, patch from wenxu. 5) Stricter netlink attribute sanity check on filters, from Romain Bellan and Florent Fourcot. 5) Annotations to make sparse happy, from Jules Irenge. 6) Improve icmp errors in debugging information, from Haishuang Yan. 7) Fix warning in IPVS icmp error debugging, from Haishuang Yan. 8) Fix endianess issue in tcp extension header, from Sergey Marinkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30devlink: Allow setting of packet trap group parametersIdo Schimmel
The previous patch allowed device drivers to publish their default binding between packet trap policers and packet trap groups. However, some users might not be content with this binding and would like to change it. In case user space passed a packet trap policer identifier when setting a packet trap group, invoke the appropriate device driver callback and pass the new policer identifier. v2: * Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in devlink_trap_group_set() and bail if not present * Add extack error message in case trap group was partially modified Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30devlink: Add packet trap group parameters supportIdo Schimmel
Packet trap groups are used to aggregate logically related packet traps. Currently, these groups allow user space to batch operations such as setting the trap action of all member traps. In order to prevent the CPU from being overwhelmed by too many trapped packets, it is desirable to bind a packet trap policer to these groups. For example, to limit all the packets that encountered an exception during routing to 10Kpps. Allow device drivers to bind default packet trap policers to packet trap groups when the latter are registered with devlink. The next patch will enable user space to change this default binding. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30devlink: Add packet trap policers supportIdo Schimmel
Devices capable of offloading the kernel's datapath and perform functions such as bridging and routing must also be able to send (trap) specific packets to the kernel (i.e., the CPU) for processing. For example, a device acting as a multicast-aware bridge must be able to trap IGMP membership reports to the kernel for processing by the bridge module. In most cases, the underlying device is capable of handling packet rates that are several orders of magnitude higher compared to those that can be handled by the CPU. Therefore, in order to prevent the underlying device from overwhelming the CPU, devices usually include packet trap policers that are able to police the trapped packets to rates that can be handled by the CPU. This patch allows capable device drivers to register their supported packet trap policers with devlink. User space can then tune the parameters of these policer (currently, rate and burst size) and read from the device the number of packets that were dropped by the policer, if supported. Subsequent patches in the series will allow device drivers to create default binding between these policers and packet trap groups and allow user space to change the binding. v2: * Add 'strict_start_type' in devlink policy * Have device drivers provide max/min rate/burst size for each policer. Use them to check validity of user provided parameters Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30bpf: Don't refcount LISTEN sockets in sk_assign()Joe Stringer
Avoid taking a reference on listen sockets by checking the socket type in the sk_assign and in the corresponding skb_steal_sock() code in the the transport layer, and by ensuring that the prefetch free (sock_pfree) function uses the same logic to check whether the socket is refcounted. Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
2020-03-30bpf: Add socket assign supportJoe Stringer
Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30devlink: Add auto dump flag to health reporterEran Ben Elisha
On low memory system, run time dumps can consume too much memory. Add administrator ability to disable auto dumps per reporter as part of the error flow handle routine. This attribute is not relevant while executing DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET. By default, auto dump is activated for any reporter that has a dump method, as part of the reporter registration to devlink. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30devlink: Implicitly set auto recover flag when registering health reporterEran Ben Elisha
When health reporter is registered to devlink, devlink will implicitly set auto recover if and only if the reporter has a recover method. No reason to explicitly get the auto recover flag from the driver. Remove this flag from all drivers that called devlink_health_reporter_create. All existing health reporters set auto recovery to true if they have a recover method. Yet, administrator can unset auto recover via netlink command as prior to this patch. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: devlink: use NL_SET_ERR_MSG_MOD instead of NL_SET_ERR_MSGJiri Pirko
The rest of the devlink code sets the extack message using NL_SET_ERR_MSG_MOD. Change the existing appearances of NL_SET_ERR_MSG to NL_SET_ERR_MSG_MOD. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: fix fraglist segmentation reference count leakFlorian Westphal
Xin Long says: On udp rx path udp_rcv_segment() may do segment where the frag skbs will get the header copied from the head skb in skb_segment_list() by calling __copy_skb_header(), which could overwrite the frag skbs' extensions by __skb_ext_copy() and cause a leak. This issue was found after loading esp_offload where a sec path ext is set in the skb. Fix this by discarding head state of the fraglist skb before replacing its contents. Fixes: 3a1296a38d0cf62 ("net: Support GRO/GSO fraglist chaining.") Cc: Steffen Klassert <steffen.klassert@secunet.com> Reported-by: Xiumei Mu <xmu@redhat.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: add timestamping related string setsMichal Kubecek
Add three string sets related to timestamping information: ETH_SS_SOF_TIMESTAMPING: SOF_TIMESTAMPING_* flags ETH_SS_TS_TX_TYPES: timestamping Tx types ETH_SS_TS_RX_FILTERS: timestamping Rx filters These will be used for TIMESTAMP_GET request. v2: avoid compiler warning ("enumeration value not handled in switch") in net_hwtstamp_validate() v3: omit dash in Tx type names ("one-step-*" -> "onestep-*"), suggested by Richard Cochran Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: ipv6: add rpl sr tunnelAlexander Aring
This patch adds functionality to configure routes for RPL source routing functionality. There is no IPIP functionality yet implemented which can be added later when the cases when to use IPv6 encapuslation comes more clear. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: add net available in build_stateAlexander Aring
The build_state callback of lwtunnel doesn't contain the net namespace structure yet. This patch will add it so we can check on specific address configuration at creation time of rpl source routes. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: Fix typo of SKB_SGO_CB_OFFSETCambda Zhu
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: page pool: allow to pass zero flags to page_pool_init()Denis Kirjanov
page pool API can be useful for non-DMA cases like xen-netfront driver so let's allow to pass zero flags to page pool flags. v2: check DMA direction only if PP_FLAG_DMA_MAP is set Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-28xdp: Support specifying expected existing program when attaching XDPToke Høiland-Jørgensen
While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
2020-03-27bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooksDaniel Borkmann
We already have the bpf_get_current_uid_gid() helper enabled, and given we now have perf event RB output available for connect(), sendmsg(), recvmsg() and bind-related hooks, add a trivial change to enable bpf_get_current_pid_tgid() and bpf_get_current_comm() as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor idDaniel Borkmann
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Allow to retrieve cgroup v1 classid from v2 hooksDaniel Borkmann
Today, Kubernetes is still operating on cgroups v1, however, it is possible to retrieve the task's classid based on 'current' out of connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators which attach to the root cgroup v2 hook in a mixed env like in case of Cilium, for example, in order to then correlate certain pod traffic and use it as part of the key for BPF map lookups. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Add netns cookie and enable it for bpf cgroup hooksDaniel Borkmann
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable perf event rb output for bpf cgroup progsDaniel Borkmann
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks are all lacking perf event rb output in order to push notifications or monitoring events up to user space. Back in commit a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops"), I've worked with Sowmini to enable them for sock_ops where the context part is not used (as opposed to skbs for example where the packet data can be appended). Make the bpf_sockopt_event_output() helper generic and enable it for mentioned hooks. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrieval of socket cookie for bind/post-bind hookDaniel Borkmann
We currently make heavy use of the socket cookie in BPF's connect(), sendmsg() and recvmsg() hooks for load-balancing decisions. However, it is currently not enabled/implemented in BPF {post-}bind hooks where it can later be used in combination for correlation in the tc egress path, for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
2020-03-27netfilter: flowtable: Fix incorrect tc_setup_type typewenxu
The indirect block setup should use TC_SETUP_FT as the type instead of TC_SETUP_BLOCK. Adjust existing users of the indirect flow block infrastructure. Fixes: b5140a36da78 ("netfilter: flowtable: add indr block setup support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-26devlink: implement DEVLINK_CMD_REGION_NEWJacob Keller
Implement support for the DEVLINK_CMD_REGION_NEW command for creating snapshots. This new command parallels the existing DEVLINK_CMD_REGION_DEL. In order for DEVLINK_CMD_REGION_NEW to work for a region, the new ".snapshot" operation must be implemented in the region's ops structure. The desired snapshot id must be provided. This helps avoid confusion on the purpose of DEVLINK_CMD_REGION_NEW, and keeps the API simpler. The requested id will be inserted into the xarray tracking the number of snapshots using each id. If this id is already used by another snapshot on any region, an error will be returned. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: track snapshot id usage count using an xarrayJacob Keller
Each snapshot created for a devlink region must have an id. These ids are supposed to be unique per "event" that caused the snapshot to be created. Drivers call devlink_region_snapshot_id_get to obtain a new id to use for a new event trigger. The id values are tracked per devlink, so that the same id number can be used if a triggering event creates multiple snapshots on different regions. There is no mechanism for snapshot ids to ever be reused. Introduce an xarray to store the count of how many snapshots are using a given id, replacing the snapshot_id field previously used for picking the next id. The devlink_region_snapshot_id_get() function will use xa_alloc to insert an initial value of 1 value at an available slot between 0 and U32_MAX. The new __devlink_snapshot_id_increment() and __devlink_snapshot_id_decrement() functions will be used to track how many snapshots currently use an id. Drivers must now call devlink_snapshot_id_put() in order to release their reference of the snapshot id after adding region snapshots. By tracking the total number of snapshots using a given id, it is possible for the decrement() function to erase the id from the xarray when it is not in use. With this method, a snapshot id can become reused again once all snapshots that referred to it have been deleted via DEVLINK_CMD_REGION_DEL, and the driver has finished adding snapshots. This work also paves the way to introduce a mechanism for userspace to request a snapshot. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: report error once U32_MAX snapshot ids have been usedJacob Keller
The devlink_snapshot_id_get() function returns a snapshot id. The snapshot id is a u32, so there is no way to indicate an error code. A future change is going to possibly add additional cases where this function could fail. Refactor the function to return the snapshot id in an argument, so that it can return zero or an error value. This ensures that snapshot ids cannot be confused with error values, and aids in the future refactor of snapshot id allocation management. Because there is no current way to release previously used snapshot ids, add a simple check ensuring that an error is reported in case the snapshot_id would over flow. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: extract snapshot id allocation to helper functionJacob Keller
A future change is going to implement a new devlink command to request a snapshot on demand. As part of this, the logic for handling the snapshot ids will be refactored. To simplify the snapshot id allocation function, move it to a separate function prefixed by `__`. This helper function will assume the lock is held. While no other callers will exist, it simplifies refactoring the logic because there is no need to complicate the function with gotos to handle unlocking on failure. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: use -ENOSPC to indicate no more room for snapshotsJacob Keller
The devlink_region_snapshot_create function returns -ENOMEM when the maximum number of snapshots has been reached. This is confusing because it is not an issue of being out of memory. Change this to use -ENOSPC instead. Reported-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: add function to take snapshot while lockedJacob Keller
A future change is going to add a new devlink command to request a snapshot on demand. This function will want to call the devlink_region_snapshot_create function while already holding the devlink instance lock. Extract the logic of this function into a static function prefixed by `__` to indicate that it is an internal helper function. Modify the original function to be implemented in terms of the new locked function. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: trivial: fix tab in function documentationJacob Keller
The function documentation comment for devlink_region_snapshot_create included a literal tab character between 'future analyses' that was difficult to spot as it happened to only display as one space wide. Fix the comment to use a space here instead of a stray tab appearing in the middle of a sentence. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: convert snapshot destructor callback to region opJacob Keller
It does not makes sense that two snapshots for a given region would use different destructors. Simplify snapshot creation by adding a .destructor op for regions. This operation will replace the data_destructor for the snapshot creation, and makes snapshot creation easier. Noticed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26devlink: prepare to support region operationsJacob Keller
Modify the devlink region code in preparation for adding new operations on regions. Create a devlink_region_ops structure, and move the name pointer from within the devlink_region structure into the ops structure (similar to the devlink_health_reporter_ops). This prepares the regions to enable support of additional operations in the future such as requesting snapshots, or accessing the region directly without a snapshot. In order to re-use the constant strings in the mlx4 driver their declaration must be changed to 'const char * const' to ensure the compiler realizes that both the data and the pointer cannot change. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26Merge tag 'mlx5-updates-2020-03-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-03-25 1) Cleanups from Dan Carpenter and wenxu. 2) Paul and Roi, Some minor updates and fixes to E-Switch to address issues introduced in the previous reg_c0 updates series. 3) Eli Cohen simplifies and improves flow steering matching group searches and flow table entries version management. 4) Parav Pandit, improves devlink eswitch mode changes thread safety. By making devlink rely on driver for thread safety and introducing mlx5 eswitch mode change protection. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25devlink: Rely on driver eswitch thread safety instead of devlinkParav Pandit
devlink_nl_cmd_eswitch_set_doit() doesn't hold devlink->lock mutex while invoking driver callback. This is likely due to eswitch mode setting involves adding/remove devlink ports, health reporters or other devlink objects for a devlink device. So it is driver responsiblity to ensure thread safe eswitch state transition happening via either sriov legacy enablement or via devlink eswitch set callback. Therefore, get() callback should also be invoked without holding devlink->lock mutex. Vendor driver can use same internal lock which it uses during eswitch mode set() callback. This makes get() and set() implimentation symmetric in devlink core and in vendor drivers. Hence, remove holding devlink->lock mutex during eswitch get() callback. Failing to do so results into below deadlock scenario when mlx5_core driver is improved to handle eswitch mode set critical section invoked by devlink and sriov sysfs interface in subsequent patch. devlink_nl_cmd_eswitch_set_doit() mlx5_eswitch_mode_set() mutex_lock(esw->mode_lock) <- Lock A [...] register_devlink_port() mutex_lock(&devlink->lock); <- lock B mutex_lock(&devlink->lock); <- lock B devlink_nl_cmd_eswitch_get_doit() mlx5_eswitch_mode_get() mutex_lock(esw->mode_lock) <- Lock A In subsequent patch, mlx5_core driver uses its internal lock during get() and set() eswitch callbacks. Other drivers have been inspected which returns either constant during get operations or reads the value from already allocated structure. Hence it is safe to remove the lock in get( ) callback and let vendor driver handle it. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Overlapping header include additions in macsec.c A bug fix in 'net' overlapping with the removal of 'version' string in ena_netdev.c Overlapping test additions in selftests Makefile Overlapping PCI ID table adjustments in iwlwifi driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} buildPablo Neira Ayuso
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’: net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’ pkt->skb->tc_redirected = 1; ^~ net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’ pkt->skb->tc_from_ingress = 1; ^~ To avoid a direct dependency with tc actions from netfilter, wrap the redirect bits around CONFIG_NET_REDIRECT and move helpers to include/linux/skbuff.h. Turn on this toggle from the ifb driver, the only existing client of these bits in the tree. This patch adds skb_set_redirected() that sets on the redirected bit on the skbuff, it specifies if the packet was redirect from ingress and resets the timestamp (timestamp reset was originally missing in the netfilter bugfix). Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress") Reported-by: noreply@ellerman.id.au Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25net: use indirect call wrappers for skb_copy_datagram_iter()Eric Dumazet
TCP recvmsg() calls skb_copy_datagram_iter(), which calls an indirect function (cb pointing to simple_copy_to_iter()) for every MSS (fragment) present in the skb. CONFIG_RETPOLINE=y forces a very expensive operation that we can avoid thanks to indirect call wrappers. This patch gives a 13% increase of performance on a single flow, if the bottleneck is the thread reading the TCP socket. Fixes: 950fcaecd5cc ("datagram: consolidate datagram copy to iter helpers") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23devlink: Only pass packet trap group identifier in trap structureIdo Schimmel
Packet trap groups are now explicitly registered by drivers and not implicitly registered when the packet traps are registered. Therefore, there is no need to encode entire group structure the trap is associated with inside the trap structure. Instead, only pass the group identifier. Refer to it as initial group identifier, as future patches will allow user space to move traps between groups. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23devlink: Stop reference counting packet trap groupsIdo Schimmel
Now that drivers explicitly register their supported packet trap groups there is no for devlink to create them on-demand and destroy them when their reference count reaches zero. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23devlink: Add API to register packet trap groupsIdo Schimmel
Currently, packet trap groups are implicitly registered by drivers upon packet trap registration. When the traps are registered, each is associated with a group and the group is created by devlink, if it does not exist already. This makes it difficult for drivers to pass additional attributes for the groups. Therefore, as a preparation for future patches that require passing additional group attributes, add an API to explicitly register / unregister these groups. Next patches will convert existing drivers to use this API. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23net: Make skb_segment not to compute checksum if network controller supports ↵Yadu Kishore
checksumming Problem: TCP checksum in the output path is not being offloaded during GSO in the following case: The network driver does not support scatter-gather but supports checksum offload with NETIF_F_HW_CSUM. Cause: skb_segment calls skb_copy_and_csum_bits if the network driver does not announce NETIF_F_SG. It does not check if the driver supports NETIF_F_HW_CSUM. So for devices which might want to offload checksum but do not support SG there is currently no way to do so if GSO is enabled. Solution: In skb_segment check if the network controller does checksum and if so call skb_copy_bits instead of skb_copy_and_csum_bits. Testing: Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled in the network driver. Observed the TCP checksum offload is not happening since the skbs received by the driver in the output path have skb->ip_summed set to CHECKSUM_NONE. With the patch ran iperf TCP traffic and observed that TCP checksum is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL. Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver to cover the newly introduced if-else code path in skb_segment. Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18netfilter: revert introduction of egress hookDaniel Borkmann
This reverts the following commits: 8537f78647c0 ("netfilter: Introduce egress hook") 5418d3881e1f ("netfilter: Generalize ingress hook") b030f194aed2 ("netfilter: Rename ingress hook include file") >From the discussion in [0], the author's main motivation to add a hook in fast path is for an out of tree kernel module, which is a red flag to begin with. Other mentioned potential use cases like NAT{64,46} is on future extensions w/o concrete code in the tree yet. Revert as suggested [1] given the weak justification to add more hooks to critical fast-path. [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/ [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Alexei Starovoitov <ast@kernel.org> Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey. 2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi. Follow up patch to clean up this new mode, from Dan Carpenter. 3) Add support for geneve tunnel options, from Xin Long. 4) Make sets built-in and remove modular infrastructure for sets, from Florian Westphal. 5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing. 6) Statify nft_pipapo_get, from Chen Wandun. 7) Use C99 flexible-array member, from Gustavo A. R. Silva. 8) More descriptive variable names for bitwise, from Jeremy Sowden. 9) Four patches to add tunnel device hardware offload to the flowtable infrastructure, from wenxu. 10) pipapo set supports for 8-bit grouping, from Stefano Brivio. 11) pipapo can switch between nibble and byte grouping, also from Stefano. 12) Add AVX2 vectorized version of pipapo, from Stefano Brivio. 13) Update pipapo to be use it for single ranges, from Stefano. 14) Add stateful expression support to elements via control plane, eg. counter per element. 15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal. 15) Add new egress hook, from Lukas Wunner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: core: dev.c: fix a documentation warningMauro Carvalho Chehab
There's a markup for link with is "foo_". On this kernel-doc comment, we don't want this, but instead, place a literal reference. So, escape the literal with ``foo``, in order to avoid this warning: ./net/core/dev.c:5195: WARNING: Unknown target name: "page_is". Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17net: ethtool: require drivers to set supported_coalesce_paramsJakub Kicinski
Now that all in-tree drivers have been updated we can make the supported_coalesce_params mandatory. To save debugging time in case some driver was missed (or is out of tree) add a warning when netdev is registered with set_coalesce but without supported_coalesce_params. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18netfilter: Introduce egress hookLukas Wunner
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets on ingress. Allow the same on egress. Position the hook immediately before a packet is handed to tc and then sent out on an interface, thereby mirroring the ingress order. This order allows marking packets in the netfilter egress hook and subsequently using the mark in tc. Another benefit of this order is consistency with a lot of existing documentation which says that egress tc is performed after netfilter hooks. Egress hooks already exist for the most common protocols, such as NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because they are executed earlier during packet processing. However for more exotic protocols, there is currently no provision to apply netfilter on egress. A common workaround is to enslave the interface to a bridge and use ebtables, or to resort to tc. But when the ingress hook was introduced, consensus was that users should be given the choice to use netfilter or tc, whichever tool suits their needs best: https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/ This hook is also useful for NAT46/NAT64, tunneling and filtering of locally generated af_packet traffic such as dhclient. There have also been occasional user requests for a netfilter egress hook in the past, e.g.: https://www.spinics.net/lists/netfilter/msg50038.html Performance measurements with pktgen surprisingly show a speedup rather than a slowdown with this commit: * Without this commit: Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags) 2920481pps 1401Mb/sec (1401830880bps) errors: 0 * With this commit: Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags) 2941410pps 1411Mb/sec (1411876800bps) errors: 0 * Without this commit + tc egress: Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags) 2562631pps 1230Mb/sec (1230062880bps) errors: 0 * With this commit + tc egress: Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags) 2659259pps 1276Mb/sec (1276444320bps) errors: 0 * With this commit + nft egress: Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags) 2413320pps 1158Mb/sec (1158393600bps) errors: 0 Tested on a bare-metal Core i7-3615QM, each measurement was performed three times to verify that the numbers are stable. Commands to perform a measurement: modprobe pktgen echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000 Commands for testing tc egress: tc qdisc add dev lo clsact tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32 Commands for testing nft egress: nft add table netdev t nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \} nft add rule netdev t co ip daddr 4.3.2.1/32 drop All testing was performed on the loopback interface to avoid distorting measurements by the packet handling in the low-level Ethernet driver. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>