tcp: get rid of sysctl_tcp_adv_win_scale

With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
author: Eric Dumazet <edumazet@google.com> 2023-07-17 15:29:17 +0000
committer: Jakub Kicinski <kuba@kernel.org> 2023-07-18 18:41:18 -0700
commit: dfa2f0483360d4d6f2324405464c9f281156bd87 (patch)
tree: 5aa4cb885109efde347b21e9669be75ff8175bf9 /net/ipv4/tcp_input.c
parent: 63c8778d9149d5df86e3a5dd192103d42d74cc93 (diff)
1 files changed, 12 insertions, 7 deletions
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 57c8af1859c1..3cd92035e090 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -237,6 +237,16 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	 */
 	len = skb_shinfo(skb)->gso_size ? : skb->len;
 	if (len >= icsk->icsk_ack.rcv_mss) {
+		/* Note: divides are still a bit expensive.
+		 * For the moment, only adjust scaling_ratio
+		 * when we update icsk_ack.rcv_mss.
+		 */
+		if (unlikely(len != icsk->icsk_ack.rcv_mss)) {
+			u64 val = (u64)skb->len << TCP_RMEM_TO_WIN_SCALE;
+
+			do_div(val, skb->truesize);
+			tcp_sk(sk)->scaling_ratio = val ? val : 1;
+		}
 		icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
 					       tcp_sk(sk)->advmss);
 		/* Account for possibly-removed options */
@@ -727,8 +737,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 
 	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
-		int rcvmem, rcvbuf;
 		u64 rcvwin, grow;
+		int rcvbuf;
 
 		/* minimal window to cope with packet losses, assuming
 		 * steady state. Add some cushion because of small variations.
@@ -740,12 +750,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
 		do_div(grow, tp->rcvq_space.space);
 		rcvwin += (grow << 1);
 
-		rcvmem = SKB_TRUESIZE(tp->advmss + MAX_TCP_HEADER);
-		while (tcp_win_from_space(sk, rcvmem) < tp->advmss)
-			rcvmem += 128;
-
-		do_div(rcvwin, tp->advmss);
-		rcvbuf = min_t(u64, rcvwin * rcvmem,
+		rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin),
 			       READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));
 		if (rcvbuf > sk->sk_rcvbuf) {
 			WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
author	Eric Dumazet <edumazet@google.com>	2023-07-17 15:29:17 +0000
committer	Jakub Kicinski <kuba@kernel.org>	2023-07-18 18:41:18 -0700
commit	dfa2f0483360d4d6f2324405464c9f281156bd87 (patch)
tree	5aa4cb885109efde347b21e9669be75ff8175bf9 /net/ipv4/tcp_input.c
parent	63c8778d9149d5df86e3a5dd192103d42d74cc93 (diff)