summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-11-30mptcp: open code mptcp variant for lock_sockPaolo Abeni
This allows invoking an additional callback under the socket spin lock. Will be used by the next patches to avoid additional spin lock contention. Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01xsk: Propagate napi_id to XDP socket Rx pathBjörn Töpel
Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-12-01xsk: Add busy-poll support for {recv,send}msg()Björn Töpel
Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If the XDP socket prefers busy-polling, make sure that no wakeup/IPI is performed. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com
2020-12-01xsk: Check need wakeup flag in sendmsg()Björn Töpel
Add a check for need wake up in sendmsg(), so that if a user calls sendmsg() when no wakeup is needed, do not trigger a wakeup. To simplify the need wakeup check in the syscall, unconditionally enable the need wakeup flag for Tx. This has a side-effect for poll(); If poll() is called for a socket without enabled need wakeup, a Tx wakeup is unconditionally performed. The wakeup matrix for AF_XDP now looks like: need wakeup | poll() | sendmsg() | recvmsg() ------------+--------------+-------------+------------ disabled | wake Tx | wake Tx | nop enabled | check flag; | check flag; | check flag; | wake Tx/Rx | wake Tx | wake Rx Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
2020-12-01xsk: Add support for recvmsg()Björn Töpel
Add support for non-blocking recvmsg() to XDP sockets. Previously, only sendmsg() was supported by XDP socket. Now, for symmetry and the upcoming busy-polling support, recvmsg() is added. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com
2020-12-01net: Add SO_BUSY_POLL_BUDGET socket optionBjörn Töpel
This option lets a user set a per socket NAPI budget for busy-polling. If the options is not set, it will use the default of 8. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
2020-12-01net: Introduce preferred busy-pollingBjörn Töpel
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d661d0 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-11-30xdp: Handle MEM_TYPE_XSK_BUFF_POOL correctly in xdp_return_buff()Björn Töpel
It turns out that it does exist a path where xdp_return_buff() is being passed an XDP buffer of type MEM_TYPE_XSK_BUFF_POOL. This path is when AF_XDP zero-copy mode is enabled, and a buffer is redirected to a DEVMAP with an attached XDP program that drops the buffer. This change simply puts the handling of MEM_TYPE_XSK_BUFF_POOL back into xdp_return_buff(). Fixes: 82c41671ca4f ("xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}") Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/bpf/20201127171726.123627-1-bjorn.topel@gmail.com
2020-11-30NFSD: Replace the internals of the READ_BUF() macroChuck Lever
Convert the READ_BUF macro in nfs4xdr.c from open code to instead use the new xdr_stream-style decoders already in use by the encode side (and by the in-kernel NFS client implementation). Once this conversion is done, each individual NFSv4 argument decoder can be independently cleaned up to replace these macros with C code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Prepare for xdr_stream-style decoding on the server-sideChuck Lever
A "permanent" struct xdr_stream is allocated in struct svc_rqst so that it is usable by all server-side decoders. A per-rqst scratch buffer is also allocated to handle decoding XDR data items that cross page boundaries. To demonstrate how it will be used, add the first call site for the new svcxdr_init_decode() API. As an additional part of the overall conversion, add symbolic constants for successful and failed XDR operations. Returning "0" is overloaded. Sometimes it means something failed, but sometimes it means success. To make it more clear when XDR decoding functions succeed or fail, introduce symbolic constants. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()Chuck Lever
Clean up: De-duplicate some frequently-used code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Move the svc_xdr_recvfrom() tracepointChuck Lever
Commit c509f15a5801 ("SUNRPC: Split the xdr_buf event class") added display of the rqst's XID to the svc_xdr_buf_class. However, when the recvfrom tracepoint fires, rq_xid has yet to be filled in with the current XID. So it ends up recording the previous XID that was handled by that svc_rqst. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: support multiple Read chunks per RPCChuck Lever
An efficient way to handle multiple Read chunks is to post them all together and then take a single completion. This is also how the code is already structured: when the Read completion fires, all portions of the incoming RPC message are available to be assembled. The difficult problem is setting up the Read sink buffers so that the server pulls the client's data into place, making subsequent pull-up unnecessary. There are several cases: * No Read chunks. No-op. * One data item Read chunk. This is the fast case, where the inline part of the RPC-over-RDMA message becomes the head and tail, and the data item chunk is placed in buf->pages. * A Position-zero Read chunk. Treated like TCP: the Read chunk is pulled into contiguous pages. + A Position-zero Read chunk with data item chunks. Treated like TCP: all of the Read chunks are pulled into contiguous pages. + Multiple data item chunks. Treated like TCP: the inline part is copied and the data item chunks are pulled into contiguous pages. The "*" cases are already supported. This patch adds support for the "+" cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use the new parsed chunk list when pulling Read chunksChuck Lever
As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Rename info::ri_chunklenChuck Lever
I'm about to change the purpose of ri_chunklen: Instead of tracking the number of bytes in one Read chunk, it will track the total number of bytes in the Read list. Rename it for clarity. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Clean up chunk tracepointsChuck Lever
We already have trace_svcrdma_decode_rseg(), which records each ingress Read segment. Instead of reporting those again when they are about to be posted as RDMA Reads, let's fire one tracepoint before posting each type of chunk. So we'll get: nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905 nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100 nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3 kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0) Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Remove chunk list pointersChuck Lever
Clean up: These pointers are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunkChuck Lever
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts of rq_res that do not contain a result payload. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg()Chuck Lever
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only the parts of rq_res that do not contain a result payload. This change has been tested to confirm that it does not cause a regression in the no Write chunk and single Write chunk cases. Multiple Write chunks have not been tested. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple write chunks when pulling upChuck Lever
When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to encode Reply transport headersChuck Lever
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing the egress RPC Reply transport header, use the new parsed Write list and Reply chunk, which are version- agnostic and already XDR decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to construct RDMA WritesChuck Lever
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to detect reverse direction repliesChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to derive the inv_rkeyChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Add a "parsed chunk list" data structureChuck Lever
This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Clean up svc_rdma_encode_reply_chunk()Chuck Lever
Refactor: Match the control flow of svc_rdma_encode_write_list(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Post RDMA Writes while XDR encoding repliesChuck Lever
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA Sends is that the responder must post the Writes on the Send queue before posting the Send that conveys the RPC Reply for that Write payload. The Linux NFS server implementation now has a transport method that can post result Payload Writes earlier than svc_rdma_sendto: ->xpo_result_payload() This gets RDMA Writes going earlier so they are more likely to be complete at the remote end before the Send completes. Some care must be taken with pulled-up Replies. We don't want to push the Write chunk and then send the same payload data via Send. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30NFSD: Invoke svc_encode_result_payload() in "read" NFSD encodersChuck Lever
Have the NFSD encoders annotate the boundaries of every direct-data-placement eligible result data payload. Then change svcrdma to use that annotation instead of the xdr->page_len when handling Write chunks. For NFSv4 on RDMA, that enables the ability to recognize multiple result payloads per compound. This is a pre-requisite for supporting multiple Write chunks per RPC transaction. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Rename svc_encode_read_payload()Chuck Lever
Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Refactor the RDMA Write pathChuck Lever
Refactor for subsequent changes. Constify the xdr_buf argument to ensure the code here does not modify it, and to enable callers to pass in a "const struct xdr_buf *". At the same time, rename the helper functions, which emit RDMA Writes, not RDMA Sends, and add documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Const-ify the xdr_buf argumentsChuck Lever
Clean up: Ensure the code in rw.c does not modify the argument, and enable callers to also use "const struct xdr_buf *". Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30SUNRPC: Adjust synopsis of xdr_buf_subsegment()Chuck Lever
Clean up: This enables xdr_buf_subsegment()'s callers to pass in a const pointer to that buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Catch another Reply chunk overflow caseChuck Lever
When space in the Reply chunk runs out in the middle of a segment, we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it oopses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Fix insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported by syzbot. 2) Remove spurious reports on nf_tables when lockdep gets disabled, from Florian Westphal. 3) Fix memleak in the error path of error path of ip_vs_control_net_init(), from Wang Hai. 4) Fix missing control data in flow dissector, otherwise IP address matching in hardware offload infra does not work. 5) Fix hardware offload match on prefix IP address when userspace does not send a bitwise expression to represent the prefix. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nftables_offload: build mask based from the matching bytes netfilter: nftables_offload: set address type in control dissector ipvs: fix possible memory leak in ip_vs_control_net_init netfilter: nf_tables: avoid false-postive lockdep splat netfilter: ipset: prevent uninit-value in hash_ip6_add ==================== Link: https://lore.kernel.org/r/20201127190313.24947-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28ipv4: Fix tos mask in inet_rtm_getroute()Guillaume Nault
When inet_rtm_getroute() was converted to use the RCU variants of ip_route_input() and ip_route_output_key(), the TOS parameters stopped being masked with IPTOS_RT_MASK before doing the route lookup. As a result, "ip route get" can return a different route than what would be used when sending real packets. For example: $ ip route add 192.0.2.11/32 dev eth0 $ ip route add unreachable 192.0.2.11/32 tos 2 $ ip route get 192.0.2.11 tos 2 RTNETLINK answers: No route to host But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would actually be routed using the first route: $ ping -c 1 -Q 2 192.0.2.11 PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data. 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms --- 192.0.2.11 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to return results consistent with real route lookups. Fixes: 3765d35ed8b9 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2020-11-28 1) Do not reference the skb for xsk's generic TX side since when looped back into RX it might crash in generic XDP, from Björn Töpel. 2) Fix umem cleanup on a partially set up xsk socket when being destroyed, from Magnus Karlsson. 3) Fix an incorrect netdev reference count when failing xsk_bind() operation, from Marek Majtyka. 4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(), from Zhen Lei. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add MAINTAINERS entry for BPF LSM bpftool: Fix error return value in build_btf_type_table net, xsk: Avoid taking multiple skbuff references xsk: Fix incorrect netdev reference count xsk: Fix umem cleanup bug at socket destruct MAINTAINERS: Update XDP and AF_XDP entries ==================== Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28Merge tag 'batadv-net-pullrequest-20201127' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches) * tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge: batman-adv: Don't always reallocate the fragmentation skb head batman-adv: Reserve needed_*room for fragments batman-adv: Consider fragmentation for needed_headroom ==================== Link: https://lore.kernel.org/r/20201127173849.19208-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversalAntoine Tenart
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the hooks as, while it's an expected value for a bridge, routing expects PACKET_HOST. The change is undone later on after hook traversal. This can be seen with pairs of functions updating skb>pkt_type and then reverting it to its original value: For hook NF_INET_PRE_ROUTING: setup_pre_routing / br_nf_pre_routing_finish For hook NF_INET_FORWARD: br_nf_forward_ip / br_nf_forward_finish But the third case where netfilter does this, for hook NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing but never reverted. A comment says: /* We assume any code from br_dev_queue_push_xmit onwards doesn't care * about the value of skb->pkt_type. */ But when having a tunnel (say vxlan) attached to a bridge we have the following call trace: br_nf_pre_routing br_nf_pre_routing_ipv6 br_nf_pre_routing_finish br_nf_forward_ip br_nf_forward_finish br_nf_post_routing <- pkt_type is updated to PACKET_HOST br_nf_dev_queue_xmit <- but not reverted to its original value vxlan_xmit vxlan_xmit_one skb_tunnel_check_pmtu <- a check on pkt_type is performed In this specific case, this creates issues such as when an ICMPv6 PTB should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST and returns early). If the comment is right and no one cares about the value of skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting it to its original value should be safe. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28net/sched: act_ct: enable stats for HW offloaded entriesMarcelo Ricardo Leitner
By setting NF_FLOWTABLE_COUNTER. Otherwise, the updates added by commit ef803b3cf96a ("netfilter: flowtable: add counter support in HW offload") are not effective when using act_ct. While at it, now that we have the flag set, protect the call to nf_ct_acct_update() by commit beb97d3a3192 ("net/sched: act_ct: update nf_conn_acct for act_ct SW offload in flowtable") with the check on NF_FLOWTABLE_COUNTER, as also done on other places. Note that this shouldn't impact performance as these stats are only enabled when net.netfilter.nf_conntrack_acct is enabled. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: wenxu <wenxu@ucloud.cn> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Link: https://lore.kernel.org/r/481a65741261fd81b0a0813e698af163477467ec.1606415787.git.marcelo.leitner@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Trivial conflict in CAN, keep the net-next + the byteswap wrapper. Conflicts: drivers/net/can/usb/gs_usb.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27tipc: update address terminology in codeJon Maloy
We update the terminology in the code so that deprecated structure names and macros are replaced with those currently recommended in the user API. struct tipc_portid -> struct tipc_socket_addr struct tipc_name -> struct tipc_service_addr struct tipc_name_seq -> struct tipc_service_range TIPC_ADDR_ID -> TIPC_SOCKET_ADDR TIPC_ADDR_NAME -> TIPC_SERVICE_ADDR TIPC_ADDR_NAMESEQ -> TIPC_SERVICE_RANGE TIPC_CFG_SRV -> TIPC_NODE_STATE Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27tipc: make node number calculation reproducibleJon Maloy
The 32-bit node number, aka node hash or node address, is calculated based on the 128-bit node identity when it is not set explicitly by the user. In future commits we will need to perform this hash operation on peer nodes while feeling safe that we obtain the same result. We do this by interpreting the initial hash as a network byte order number. Whenever we need to use the number locally on a node we must therefore translate it to host byte order to obtain an architecure independent result. Furthermore, given the context where we use this number, we must not allow it to be zero unless the node identity also is zero. Hence, in the rare cases when the xor-ed hash value may end up as zero we replace it with a fix number, knowing that the code anyway is capable of handling hash collisions. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27tipc: refactor tipc_sk_bind() functionJon Maloy
We refactor the tipc_sk_bind() function, so that the lock handling is handled separately from the logics. We also move some sanity tests to earlier in the call chain, to the function tipc_bind(). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/x25: remove x25_kill_by_device()Martin Schiller
Remove obsolete function x25_kill_by_device(). It's not used any more. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/x25: fix restart request/confirm handlingMartin Schiller
We have to take the actual link state into account to handle restart requests/confirms well. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/lapb: fix t1 timer handling for LAPB_STATE_0Martin Schiller
1. DTE interface changes immediately to LAPB_STATE_1 and start sending SABM(E). 2. DCE interface sends N2-times DM and changes to LAPB_STATE_1 afterwards if there is no response in the meantime. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/lapb: support netdev eventsMartin Schiller
This patch allows layer2 (LAPB) to react to netdev events itself and avoids the detour via layer3 (X.25). 1. Establish layer2 on NETDEV_UP events, if the carrier is already up. 2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal the peer that the connection will go down. (Only when the carrier is up.) 3. When a NETDEV_DOWN event occur, clear all queues, enter state LAPB_STATE_0 and stop all timers. 4. The NETDEV_CHANGE event makes it possible to handle carrier loss and detection. In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0 and stop all timers. In case of Carrier Detection, we start timer t1 on a DCE interface, and on a DTE interface we change to state LAPB_STATE_1 and start sending SABM(E). Signed-off-by: Martin Schiller <ms@dev.tdt.de> Acked-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/x25: handle additional netdev eventsMartin Schiller
1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE. This change is needed so that the x25_neigh struct for an interface is already created when it shows up and is kept independently if the interface goes UP or DOWN. This is used in an upcomming commit, where x25 params of an neighbour will get configurable through ioctls. 2. NETDEV_CHANGE event makes it possible to handle carrier loss and detection. If carrier is lost, clean up everything related to this neighbour by calling x25_link_terminated(). 3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the call to x25_clear_forward_by_dev() in x25_route_device_down(), as this is already called by x25_kill_by_neigh() which gets called by x25_link_terminated(). 4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by layer2 when layer2 link is established and layer3 link should be initiated. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/sched: sch_frag: add generic packet fragment support.wenxu
Currently kernel tc subsystem can do conntrack in cat_ct. But when several fragment packets go through the act_ct, function tcf_ct_handle_fragments will defrag the packets to a big one. But the last action will redirect mirred to a device which maybe lead the reassembly big packet over the mtu of target device. This patch add support for a xmit hook to mirred, that gets executed before xmiting the packet. Then, when act_ct gets loaded, it configs that hook. The frag xmit hook maybe reused by other modules. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27net/sched: act_mirred: refactor the handle of xmitwenxu
This one is prepare for the next patch. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>