linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2020-07-04	netfilter: nf_tables: expose enum nft_chain_flags through UAPI	Pablo Neira Ayuso
	This enum definition was never exposed through UAPI. Rename NFT_BASE_CHAIN to NFT_CHAIN_BASE for consistency. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04	netfilter: nf_tables: add NFTA_VERDICT_CHAIN_ID attribute	Pablo Neira Ayuso
	This netlink attribute allows you to identify the chain to jump/goto by means of the chain ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04	netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute	Pablo Neira Ayuso
	This new netlink attribute allows you to add rules to chains by the chain ID. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04	netfilter: nf_tables: add NFTA_CHAIN_ID attribute	Pablo Neira Ayuso
	This netlink attribute allows you to refer to chains inside a transaction as an alternative to the name and the handle. The chain binding support requires this new chain ID approach. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-04	ipvs: allow connection reuse for unconfirmed conntrack	Julian Anastasov
	YangYuxi is reporting that connection reuse is causing one-second delay when SYN hits existing connection in TIME_WAIT state. Such delay was added to give time to expire both the IPVS connection and the corresponding conntrack. This was considered a rare case at that time but it is causing problem for some environments such as Kubernetes. As nf_conntrack_tcp_packet() can decide to release the conntrack in TIME_WAIT state and to replace it with a fresh NEW conntrack, we can use this to allow rescheduling just by tuning our check: if the conntrack is confirmed we can not schedule it to different real server and the one-second delay still applies but if new conntrack was created, we are free to select new real server without any delays. YangYuxi lists some of the problem reports: - One second connection delay in masquerading mode: https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747 - Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544 - Additional 1s latency in `host -> service IP -> pod` https://github.com/kubernetes/kubernetes/issues/90854 Fixes: f719e3754ee2 ("ipvs: drop first packet to redirect conntrack") Co-developed-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: YangYuxi <yx.atom1@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-07-01	ipvs: avoid expiring many connections from timer	Julian Anastasov
	Add new functions ip_vs_conn_del() and ip_vs_conn_del_put() to release many IPVS connections in process context. They are suitable for connections found in table when we do not want to overload the timers. Currently, the change is useful for the dropentry delayed work but it will be used also in following patch when flushing connections to failed destinations. Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30	ipvs: register hooks only with services	Julian Anastasov
	Keep the IPVS hooks registered in Netfilter only while there are configured virtual services. This saves CPU cycles while IPVS is loaded but not used. Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30	netfilter: nft_set_pipapo: Drop useless assignment of scratch map index on ↵	Stefano Brivio
	insert In nft_pipapo_insert(), we need to reallocate scratch maps that will be used for matching by lookup functions, if they have never been allocated or if the bucket size changes as a result of the insertion. As pipapo_realloc_scratch() provides a pair of fresh, zeroed out maps, there's no need to select a particular one after reallocation. Other than being useless, the existing assignment was also troubled by the fact that the index was set only on the CPU performing the actual insertion, as spotted by Florian. Simply drop the assignment. Reported-by: Florian Westphal <fw@strlen.de> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-30	netfilter: introduce support for reject at prerouting stage	Laura Garcia Liebana
	REJECT statement can be only used in INPUT, FORWARD and OUTPUT chains. This patch adds support of REJECT, both icmp and tcp reset, at PREROUTING stage. The need for this patch comes from the requirement of some forwarding devices to reject traffic before the natting and routing decisions. The main use case is to be able to send a graceful termination to legitimate clients that, under any circumstances, the NATed endpoints are not available. This option allows clients to decide either to perform a reconnection or manage the error in their side, instead of just dropping the connection and let them die due to timeout. It is supported ipv4, ipv6 and inet families for nft infrastructure. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-06-29	Merge branch 'dpaa2-eth-send-a-scatter-gather-FD-instead-of-realloc-ing'	David S. Miller
	Ioana Ciornei says: ==================== dpaa2-eth: send a scatter-gather FD instead of realloc-ing This patch set changes the behaviour in case the Tx path is confroted with an SKB with insufficient headroom for our hardware necessities (SW annotation area). In the first patch, instead of realloc-ing the SKB we now send a S/G frames descriptor while the second one adds a new software held counter to account for for these types of frames. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	dpaa2-eth: add software counter for Tx frames converted to S/G	Ioana Ciornei
	With the previous commit, in case of insufficient SKB headroom on the Tx path instead of reallocing the SKB we now send a S/G frame descriptor. Export the number of occurences of this case as a per CPU counter (in debugfs) and a total number in the ethtool statistics - "tx converted sg frames'. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	dpaa2-eth: send a scatter-gather FD instead of realloc-ing	Ioana Ciornei
	Instead of realloc-ing the skb on the Tx path when the provided headroom is smaller than the HW requirements, create a Scatter/Gather frame descriptor with only one entry. Remove the '[drv] tx realloc frames' counter exposed previously through ethtool since it is no longer used. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	Merge branch 'sfc-prerequisites-for-EF100-driver-part-1'	David S. Miller
	Edward Cree says: ==================== sfc: prerequisites for EF100 driver, part 1 This continues the work started by Alex Maftei <amaftei@solarflare.com> in the series "sfc: code refactoring", "sfc: more code refactoring", "sfc: even more code refactoring" and "sfc: refactor mcdi filtering code", to prepare for a new driver which will share much of the code to support the new EF100 family of Solarflare/Xilinx NICs. After this series, there will be approximately two more of these 'prerequisites' series, followed by the sfc_ef100 driver itself. v2: fix reverse xmas tree in patch 5. (Left the cases in patches 7, 9 and 14 alone as those are all in pure movement of existing code.) ==================== Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: extend common GRO interface to support CHECKSUM_COMPLETE	Edward Cree
	EF100 will use CHECKSUM_COMPLETE, but will also make use of efx_rx_packet_gro(), thus needs to be able to pass the checksum value into that function. Drivers for older NICs pass in a csum of 0 to get the old semantics (use the RX flags for CHECKSUM_UNNECESSARY marking). Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise ARFS handling	Edward Cree
	EF100 will use the same approach to ARFS as EF10. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise drain event handling	Edward Cree
	Avoids a call from generic MCDI code into ef10.c. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise PCI error handlers	Edward Cree
	EF100 will use the same mechanisms for PCI error recovery. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: track which BAR is mapped	Edward Cree
	EF100 needs to map multiple BARs (sequentially, not concurrently) in order to read the Function Control Window during probe. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise FC advertising	Edward Cree
	Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise other ethtool bits	Edward Cree
	A few more ethtool handlers which EF100 will share. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise ethtool NFC and RXFH/RSS functions	Edward Cree
	EF100 will share EF10's model of filtering, hashing and spreading. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: commonise ethtool link handling functions	Edward Cree
	Link speeds, FEC, and autonegotiation are all things EF100 will share. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: split up nic.h	Edward Cree
	The new nic_common.h contains the inlines for NIC-type function dispatch, declarations for NIC-generic functions in nic.c, and other similar NIC- generic functionality. Retained in nic.h are NIC-specific declarations such as the siena and ef10 nic_data structs and various farch functions. The EF100 driver will thus include nic_common.h but not nic.h. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: refactor EF10 stats handling	Edward Cree
	Separate the generation-count handling from the format conversion, to make it easier to re-use both for EF100. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: don't try to create more channels than we can have VIs	Edward Cree
	Calculate efx->max_vis at probe time, and check against it in efx_allocate_msix_channels() when considering whether to create XDP TX channels. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: extend bitfield macros up to POPULATE_DWORD_13	Edward Cree
	Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: determine flag word automatically in efx_has_cap()	Edward Cree
	Now that we have an _OFST definition for each individual flag bit, callers of efx_has_cap() don't need to specify which flag word it's in; we can just use the flag name directly in MCDI_CAPABILITY_OFST. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	sfc: update MCDI protocol headers	Edward Cree
	The script used to generate these now includes _OFST definitions for flags, to identify the containing flag word. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net:qos: police action offloading parameter 'burst' change to the original value	Po Liu
	Since 'tcfp_burst' with TICK factor, driver side always need to recover it to the original value, this patch moves the generic calculation and recover to the 'burst' original value before offloading to device driver. Signed-off-by: Po Liu <po.liu@nxp.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	Merge branch 'MPTCP-improve-fallback-to-TCP'	David S. Miller
	Davide Caratti says: ==================== MPTCP: improve fallback to TCP there are situations where MPTCP sockets should fall-back to regular TCP: this series reworks the fallback code to pursue the following goals: 1) cleanup the non fallback code, removing most of 'if (<fallback>)' in the data path 2) improve performance for non-fallback sockets, avoiding locks in poll() further work will also leverage on this changes to achieve: a) more consistent behavior of gestockopt()/setsockopt() on passive sockets after fallback b) support for "infinite maps" as per RFC8684, section 3.7 the series is made of the following items: - patch 1 lets sendmsg() / recvmsg() / poll() use the main socket also after fallback - patch 2 fixes 'simultaneous connect' scenario after fallback. The problem was present also before the rework, but the fix is much easier to implement after patch 1 - patch 3, 4, 5 are clean-ups for code that is no more needed after the fallback rework - patch 6 fixes a race condition between close() and poll(). The problem was theoretically present before the rework, but it became almost systematic after patch 1 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	mptcp: close poll() races	Paolo Abeni
	mptcp_poll always return POLLOUT for unblocking connect(), ensure that the socket is a suitable state. The MPTCP_DATA_READY bit is never cleared on accept: ensure we don't leave mptcp_accept() with an empty accept queue and such bit set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	mptcp: __mptcp_tcp_fallback() returns a struct sock	Paolo Abeni
	Currently __mptcp_tcp_fallback() always return NULL on incoming connections, because MPTCP does not create the additional socket for the first subflow. Since the previous commit no __mptcp_tcp_fallback() caller needs a struct socket, so let __mptcp_tcp_fallback() return the first subflow sock and cope correctly even with incoming connections. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	mptcp: create first subflow at msk creation time	Paolo Abeni
	This cleans the code a bit and makes the behavior more consistent. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	mptcp: check for plain TCP sock at accept time	Paolo Abeni
	This cleanup the code a bit and avoid corrupted states on weird syscall sequence (accept(), connect()). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	mptcp: fallback in case of simultaneous connect	Davide Caratti
	when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: mptcp: improve fallback to TCP	Davide Caratti
	Keep using MPTCP sockets and a use "dummy mapping" in case of fallback to regular TCP. When fallback is triggered, skip addition of the MPTCP option on send. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22 Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: phy: marvell10g: support XFI rate matching mode	Baruch Siach
	When the hardware MACTYPE hardware configuration pins are set to "XFI with Rate Matching" the PHY interface operate at fixed 10Gbps speed. The MAC buffer packets in both directions to match various wire speeds. Read the MAC Type field in the Port Control register, and set the MAC interface speed accordingly. Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	Merge tag 'mlx5-tls-2020-06-26' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-tls-2020-06-26 1) Improve hardware layouts and structure for kTLS support 2) Generalize ICOSQ (Internal Channel Operations Send Queue) Due to the asynchronous nature of adding new kTLS flows and handling HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to support generic async operations, such as kTLS add flow and resync, in addition to the existing XSK usages. 3) kTLS hardware flow steering and classification: The driver already has the means to classify TCP ipv4/6 flows to send them to the corresponding RSS HW engine, as reflected in patches 3 through 5, the series will add a steering layer that will hook to the driver's TCP classifiers and will match on well known kTLS connection, in case of a match traffic will be redirected to the kTLS decryption engine, otherwise traffic will continue flowing normally to the TCP RSS engine. 3) kTLS add flow RX HW offload support New offload contexts post their static/progress params WQEs (Work Queue Element) to communicate the newly added kTLS contexts over the per-channel async ICOSQ. The Channel/RQ is selected according to the socket's rxq index. A new TLS-RX workqueue is used to allow asynchronous addition of steering rules, out of the NAPI context. It will be also used in a downstream patch in the resync procedure. Feature is OFF by default. Can be turned on by: $ ethtool -K <if> tls-hw-rx-offload on 4) Added mlx5 kTLS sw stats and new counters are documented in Documentation/networking/tls-offload.rst rx_tls_ctx - number of TLS RX HW offload contexts added to device for decryption. rx_tls_ooo - number of RX packets which were part of a TLS stream but did not arrive in the expected order and triggered the resync procedure. rx_tls_del - number of TLS RX HW offload contexts deleted from device (connection has finished). rx_tls_err - number of RX packets which were part of a TLS stream but were not decrypted due to unexpected error in the state machine. 5) Asynchronous RX resync a. The NIC driver indicates that it would like to resync on some TLS record within the received packet (P), but the driver does not know (yet) which of the TLS records within the packet. At this stage, the NIC driver will query the device to find the exact TCP sequence for resync (tcpsn), however, the driver does not wait for the device to provide the response. b. Eventually, the device responds, and the driver provides the tcpsn within the resync packet to KTLS. Now, KTLS can check the tcpsn against any processed TLS records within packet P, and also against any record that is processed in the future within packet P. The asynchronous resync path simplifies the device driver, as it can save bits on the packet completion (32-bit TCP sequence), and pass this information on an asynchronous command instead. Performance: CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off NIC: ConnectX-6 Dx 100GbE dual port Goodput (app-layer throughput) comparison: +---------------+-------+-------+---------+ \| # connections \| 1 \| 4 \| 8 \| +---------------+-------+-------+---------+ \| SW (Gbps) \| 7.26 \| 24.70 \| 50.30 \| +---------------+-------+-------+---------+ \| HW (Gbps) \| 18.50 \| 64.30 \| 92.90 \| +---------------+-------+-------+---------+ \| Speedup \| 2.55x \| 2.56x \| 1.85x * \| +---------------+-------+-------+---------+ * After linerate is reached, diff is observed in CPU util ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	Merge branch 'TC-Introduce-qevents'	David S. Miller
	Petr Machata says: ==================== TC: Introduce qevents The Spectrum hardware allows execution of one of several actions as a result of queue management decisions: tail-dropping, early-dropping, marking a packet, or passing a configured latency threshold or buffer size. Such packets can be mirrored, trapped, or sampled. Modeling the action to be taken as simply a TC action is very attractive, but it is not obvious where to put these actions. At least with ECN marking one could imagine a tree of qdiscs and classifiers that effectively accomplishes this task, albeit in an impractically complex manner. But there is just no way to match on dropped-ness of a packet, let alone dropped-ness due to a particular reason. To allow configuring user-defined actions as a result of inner workings of a qdisc, this patch set introduces a concept of qevents. Those are attach points for TC blocks, where filters can be put that are executed as the packet hits well-defined points in the qdisc algorithms. The attached blocks can be shared, in a manner similar to clsact ingress and egress blocks, arbitrary classifiers with arbitrary actions can be put on them, etc. For example: red limit 500K avpkt 1K qevent early_drop block 10 matchall action mirred egress mirror dev eth1 The central patch #2 introduces several helpers to allow easy and uniform addition of qevents to qdiscs: initialization, destruction, qevent block number change validation, and qevent handling, i.e. dispatch of the filters attached to the block bound to a qevent. Patch #1 adds root_lock argument to qdisc enqueue op. The problem this is tackling is that if a qevent filter pushes packets to the same qdisc tree that holds the qevent in the first place, attempt to take qdisc root lock for the second time will lead to a deadlock. To solve the issue, qevent handler needs to unlock and relock the root lock around the filter processing. Passing root_lock around makes it possible to get the lock where it is needed, and visibly so, such that it is obvious the lock will be used when invoking a qevent. The following two patches, #3 and #4, then add two qevents to the RED qdisc: "early_drop" qevent fires when a packet is early-dropped; "mark" qevent, when it is ECN-marked. Patch #5 contains a selftest. I have mentioned this test when pushing the RED ECN nodrop mode and said that "I have no confidence in its portability to [...] different configurations". That still holds. The backlog and packet size are tuned to make the test deterministic. But it is better than nothing, and on the boxes that I ran it on it does work and shows that qevents work the way they are supposed to, and that their addition has not broken the other tested features. This patch set does not deal with offloading. The idea there is that a driver will be able to figure out that a given block is used in qevent context by looking at binder type. A future patch-set will add a qdisc pointer to struct flow_block_offload, which a driver will be able to consult to glean the TC or other relevant attributes. Changes from RFC to v1: - Move a "q = qdisc_priv(sch)" from patch #3 to patch #4 - Fix deadlock caused by mirroring packet back to the same qdisc tree. - Rename "tail" qevent to "tail_drop". - Adapt to the new 100-column standard. - Add a selftest ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	selftests: forwarding: Add a RED test for SW datapath	Petr Machata
	This test is inspired by the mlxsw RED selftest. It is much simpler to set up (also because there is no point in testing PRIO / RED encapsulation). It tests bare RED, ECN and ECN+nodrop modes of operation. On top of that it tests RED early_drop and mark qevents. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: sched: sch_red: Add qevents "early_drop" and "mark"	Petr Machata
	In order to allow acting on dropped and/or ECN-marked packets, add two new qevents to the RED qdisc: "early_drop" and "mark". Filters attached at "early_drop" block are executed as packets are early-dropped, those attached at the "mark" block are executed as packets are ECN-marked. Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark" qevent. Absence of these attributes signifies "don't care": no block is allocated in that case, or the existing blocks are left intact in case of the change callback. For purposes of offloading, blocks attached to these qevents appear with newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and FLOW_BLOCK_BINDER_TYPE_RED_MARK. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: sched: sch_red: Split init and change callbacks	Petr Machata
	In the following patches, RED will get two qevents. The implementation will be clearer if the callback for change is not a pure subset of the callback for init. Split the two and promote attribute parsing to the callbacks themselves from the common code, because it will be handy there. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: sched: Introduce helpers for qevent blocks	Petr Machata
	Qevents are attach points for TC blocks, where filters can be put that are executed when "interesting events" take place in a qdisc. The data to keep and the functions to invoke to maintain a qevent will be largely the same between qevents. Therefore introduce sched-wide helpers for qevent management. Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc, blocks attachment cannot be changed after the qdisc is created. To that end, add a helper tcf_qevent_validate_change(), which verifies whether block index attribute is not attached, or if it is, whether its value matches the current one (i.e. there is no material change). The function tcf_qevent_handle() should be invoked when qdisc hits the "interesting event" corresponding to a block. This function releases root lock for the duration of executing the attached filters, to allow packets generated through user actions (notably mirred) to be reinserted to the same qdisc tree. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: sched: Pass root lock to Qdisc_ops.enqueue	Petr Machata
	A following patch introduces qevents, points in qdisc algorithm where packet can be processed by user-defined filters. Should this processing lead to a situation where a new packet is to be enqueued on the same port, holding the root lock would lead to deadlocks. To solve the issue, qevent handler needs to unlock and relock the root lock when necessary. To that end, add the root lock argument to the qdisc op enqueue, and propagate throughout. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	Merge branch 'net-ethernet-ti-am65-cpsw-update-and-enable-sr2-0-soc'	David S. Miller
	Grygorii Strashko says: ==================== net: ethernet: ti: am65-cpsw: update and enable sr2.0 soc This series contains set of improvements for TI AM654x/J721E CPSW2G driver and adds support for TI AM654x SR2.0 SoC. Patch 1: adds vlans restoration after "if down/up" Patches 2-5: improvments Patch 6: adds support for TI AM654x SR2.0 SoC which allows to disable errata i2027 W/A. By default, errata i2027 W/A (TX csum offload disabled) is enabled on AM654x SoC for backward compatibility, unless SR2.0 SoC is identified using SOC BUS framework. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: ethernet: ti: am65-cpsw-nuss: enable am65x sr2.0 support	Grygorii Strashko
	The AM65x SR2.0 MCU CPSW has fixed errata i2027 "CPSW: CPSW Does Not Support CPPI Receive Checksum (Host to Ethernet) Offload Feature". This errata also fixed for J271E SoC. Use SOC bus data for K3 SoC identification and apply i2027 errata w/a only for the AM65x SR1.0 SoC. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: ethernet: ti: am65-cpsw-ethtool: configured critical setting only when ↵	Grygorii Strashko
	no running netdevs Ensure that critical setting can only be configured when there are no running netdevs - all ports are down. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: ethernet: ti: am65-cpsw-ethtool: skip hw cfg when change p0-rx-ptype-rrobin	Grygorii Strashko
	Skip HW configuration when p0-rx-ptype-rrobin is changed as it will be done by .ndev_open(), Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: ethernet: ti: am65-cpsw-nuss: fix ports mac sl initialization	Grygorii Strashko
	The MAC SL has to be initialized for each port otherwise am65_cpsw_nuss_slave_disable_unused() will crash for disabled ports. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29	net: ethernet: ti: am65-cpsw: move to pf_p0_rx_ptype_rrobin init in probe	Grygorii Strashko
	The pf_p0_rx_ptype_rrobin is global parameter so move its initialization in probe. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>