summaryrefslogtreecommitdiff
path: root/net/mptcp
AgeCommit message (Collapse)Author
2020-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Conflicts were all overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: fix uninitialized value accessPaolo Abeni
tcp_v{4,6}_syn_recv_sock() set 'own_req' only when returning a not NULL 'child', let's check 'own_req' only if child is available to avoid an - unharmful - UBSAN splat. v1 -> v2: - reference the correct hash Fixes: 4c8941de781c ("mptcp: avoid flipping mp_capable field in syn_recv_sock()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: initialize the data_fin field for mpc packetsPaolo Abeni
When parsing MPC+data packets we set the dss field, so we must also initialize the data_fin, or we can find stray value there. Fixes: 9a19371bf029 ("mptcp: fix data_fin handing in RX path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: fix 'use_ack' option access.Paolo Abeni
The mentioned RX option field is initialized only for DSS packet, we must access it only if 'dss' is set too, or the subflow will end-up in a bad status, leading to RFC violations. Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: avoid a WARN on bad input.Paolo Abeni
Syzcaller has found a way to trigger the WARN_ON_ONCE condition in check_fully_established(). The root cause is a legit fallback to TCP scenario, so replace the WARN with a plain message on a more strict condition. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: move option parsing into mptcp_incoming_options()Paolo Abeni
The mptcp_options_received structure carries several per packet flags (mp_capable, mp_join, etc.). Such fields must be cleared on each packet, even on dropped ones or packet not carrying any MPTCP options, but the current mptcp code clears them only on TCP option reset. On several races/corner cases we end-up with stray bits in incoming options, leading to WARN_ON splats. e.g.: [ 171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713 [ 171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531) [ 171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata [ 171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95 [ 171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014 [ 171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531) [ 171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c [ 171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282 [ 171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e [ 171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955 [ 171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca [ 171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020 [ 171.228460] FS: 00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000 [ 171.230065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0 [ 171.232586] Call Trace: [ 171.233109] <IRQ> [ 171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691) [ 171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832) [ 171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1)) [ 171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217) [ 171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822) [ 171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730) [ 171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009) [ 171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1)) [ 171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232) [ 171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252) [ 171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539) [ 171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135) [ 171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640) [ 171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293) [ 171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083) [ 171.282358] </IRQ> We could address the issue clearing explicitly the relevant fields in several places - tcp_parse_option, tcp_fast_parse_options, possibly others. Instead we move the MPTCP option parsing into the already existing mptcp ingress hook, so that we need to clear the fields in a single place. This allows us dropping an MPTCP hook from the TCP code and removing the quite large mptcp_options_received from the tcp_sock struct. On the flip side, the MPTCP sockets will traverse the option space twice (in tcp_parse_option() and in mptcp_incoming_options(). That looks acceptable: we already do that for syn and 3rd ack packets, plain TCP socket will benefit from it, and even MPTCP sockets will experience better code locality, reducing the jumps between TCP and MPTCP code. v1 -> v2: - rebased on current '-net' tree Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30mptcp: consolidate synack processing.Paolo Abeni
Currently the MPTCP code uses 2 hooks to process syn-ack packets, mptcp_rcv_synsent() and the sk_rx_dst_set() callback. We can drop the first, moving the relevant code into the latter, reducing the hooking into the TCP code. This is also needed by the next patch. v1 -> v2: - use local tcp sock ptr instead of casting the sk variable several times - DaveM Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29mptcp: replace mptcp_disconnect with a stubFlorian Westphal
Paolo points out that mptcp_disconnect is bogus: "lock_sock(sk); looks suspicious (lock should be already held by the caller) And call to: tcp_disconnect(sk, flags); too, sk is not a tcp socket". ->disconnect() gets called from e.g. inet_stream_connect when one tries to disassociate a connected socket again (to re-connect without closing the socket first). MPTCP however uses mptcp_stream_connect, not inet_stream_connect, for the mptcp-socket connect call. inet_stream_connect only gets called indirectly, for the tcp socket, so any ->disconnect() calls end up calling tcp_disconnect for that tcp subflow sk. This also explains why syzkaller has not yet reported a problem here. So for now replace this with a stub that doesn't do anything. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/14 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25mptcp: fix race in msk status updatePaolo Abeni
Currently subflow_finish_connect() changes unconditionally any msk socket status other than TCP_ESTABLISHED. If an unblocking connect() races with close(), we can end-up triggering: IPv4: Attempt to release TCP socket in state 1 00000000e32b8b7e when the msk socket is disposed. Be sure to enter the established status only from SYN_SENT. Fixes: c3c123d16c0e ("net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25tcp: mptcp: use mptcp receive buffer space to select rcv windowFlorian Westphal
In MPTCP, the receive window is shared across all subflows, because it refers to the mptcp-level sequence space. MPTCP receivers already place incoming packets on the mptcp socket receive queue and will charge it to the mptcp socket rcvbuf until userspace consumes the data. Update __tcp_select_window to use the occupancy of the parent/mptcp socket instead of the subflow socket in case the tcp socket is part of a logical mptcp connection. This commit doesn't change choice of initial window for passive or active connections. While it would be possible to change those as well, this adds complexity (especially when handling MP_JOIN requests). Furthermore, the MPTCP RFC specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field of a TCP segment at the connection level if it does not also carry a DSS option with a Data ACK field.' SYN/SYNACK packets do not carry a DSS option with a Data ACK field. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23mptcp/pm_netlink.c : add check for nla_put_in/6_addrBo YU
Normal there should be checked for nla_put_in6_addr like other usage in net. Detected by CoverityScan, CID# 1461639 Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Signed-off-by: Bo YU <tsu.yubo@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22mptcp: fix data_fin handing in RX pathPaolo Abeni
The data fin flag is set only via a DSS option, but mptcp_incoming_options() copies it unconditionally from the provided RX options. Since we do not clear all the mptcp sock RX options in a socket free/alloc cycle, we can end-up with a stray data_fin value while parsing e.g. MPC packets. That would lead to mapping data corruption and will trigger a few WARN_ON() in the RX path. Instead of adding a costly memset(), fetch the data_fin flag only for DSS packets - when we always explicitly initialize such bit at option parsing time. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20mptcp: drop req socket remote_key* fieldsPaolo Abeni
We don't need them, as we can use the current ingress opt data instead. Setting them in syn_recv_sock() may causes inconsistent mptcp socket status, as per previous commit. Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20mptcp: avoid flipping mp_capable field in syn_recv_sock()Paolo Abeni
If multiple CPUs races on the same req_sock in syn_recv_sock(), flipping such field can cause inconsistent child socket status. When racing, the CPU losing the req ownership may still change the mptcp request socket mp_capable flag while the CPU owning the request is cloning the socket, leaving the child socket with 'is_mptcp' set but no 'mp_capable' flag. Such socket will stay with 'conn' field cleared, heading to oops in later mptcp callback. Address the issue tracking the fallback status in a local variable. Fixes: 58b09919626b ("mptcp: create msk early") Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20mptcp: handle mptcp listener destruction via rcuFlorian Westphal
Following splat can occur during self test: BUG: KASAN: use-after-free in subflow_data_ready+0x156/0x160 Read of size 8 at addr ffff888100c35c28 by task mptcp_connect/4808 subflow_data_ready+0x156/0x160 tcp_child_process+0x6a3/0xb30 tcp_v4_rcv+0x2231/0x3730 ip_protocol_deliver_rcu+0x5c/0x860 ip_local_deliver_finish+0x220/0x360 ip_local_deliver+0x1c8/0x4e0 ip_rcv_finish+0x1da/0x2f0 ip_rcv+0xd0/0x3c0 __netif_receive_skb_one_core+0xf5/0x160 __netif_receive_skb+0x27/0x1c0 process_backlog+0x21e/0x780 net_rx_action+0x35f/0xe90 do_softirq+0x4c/0x50 [..] This occurs when accessing subflow_ctx->conn. Problem is that tcp_child_process() calls listen sockets' sk_data_ready() notification, but it doesn't hold the listener lock. Another cpu calling close() on the listener will then cause transition of refcount to 0. Fixes: 58b09919626bf ("mptcp: create msk early") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18mptcp: fix 'Attempt to release TCP socket in state' warningsFlorian Westphal
We need to set sk_state to CLOSED, else we will get following: IPv4: Attempt to release TCP socket in state 3 00000000b95f109e IPv4: Attempt to release TCP socket in state 10 00000000b95f109e First one is from inet_sock_destruct(), second one from mptcp_sk_clone failure handling. Setting sk_state to CLOSED isn't enough, we also need to orphan sk so it has DEAD flag set. Otherwise, a very similar warning is printed from inet_sock_destruct(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18mptcp: fix splat when incoming connection is never accepted before exit/closeFlorian Westphal
Following snippet (replicated from syzkaller reproducer) generates warning: "IPv4: Attempt to release TCP socket in state 1". int main(void) { struct sockaddr_in sin1 = { .sin_family = 2, .sin_port = 0x4e20, .sin_addr.s_addr = 0x010000e0, }; struct sockaddr_in sin2 = { .sin_family = 2, .sin_addr.s_addr = 0x0100007f, }; struct sockaddr_in sin3 = { .sin_family = 2, .sin_port = 0x4e20, .sin_addr.s_addr = 0x0100007f, }; int r0 = socket(0x2, 0x1, 0x106); int r1 = socket(0x2, 0x1, 0x106); bind(r1, (void *)&sin1, sizeof(sin1)); connect(r1, (void *)&sin2, sizeof(sin2)); listen(r1, 3); return connect(r0, (void *)&sin3, 0x4d); } Reason is that the newly generated mptcp socket is closed via the ulp release of the tcp listener socket when its accept backlog gets purged. To fix this, delay setting the ESTABLISHED state until after userspace calls accept and via mptcp specific destructor. Fixes: 58b09919626bf ("mptcp: create msk early") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/9 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-12mptcp: fix double-unlock in mptcp_pollFlorian Westphal
mptcp_connect/28740 is trying to release lock (sk_lock-AF_INET) at: [<ffffffff82c15869>] mptcp_poll+0xb9/0x550 but there are no more locks to release! Call Trace: lock_release+0x50f/0x750 release_sock+0x171/0x1b0 mptcp_poll+0xb9/0x550 sock_poll+0x157/0x470 ? get_net_ns+0xb0/0xb0 do_sys_poll+0x63c/0xdd0 Problem is that __mptcp_tcp_fallback() releases the mptcp socket lock, but after recent change it doesn't do this in all of its return paths. To fix this, remove the unlock from __mptcp_tcp_fallback() and always do the unlock in the caller. Also add a small comment as to why we have this __mptcp_needs_tcp_fallback(). Fixes: 0b4f33def7bbde ("mptcp: fix tcp fallback crash") Reported-by: syzbot+e56606435b7bfeea8cf5@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03mptcp: add some missing pr_fmt definesGeliang Tang
Some of the mptcp logs didn't print out the format string: [ 185.651493] DSS [ 185.651494] data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1 [ 185.651494] data_ack=13792750332298763796 [ 185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=0 skb=0000000063dc595d [ 185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 status=0 [ 185.651495] MPTCP: msk ack_seq=9bbc894565aa2f9a subflow ack_seq=9bbc894565aa2f9a [ 185.651496] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=1 skb=0000000012e809e1 So this patch added these missing pr_fmt defines. Then we can get the same format string "MPTCP" in all mptcp logs like this: [ 142.795829] MPTCP: DSS [ 142.795829] MPTCP: data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1 [ 142.795829] MPTCP: data_ack=8089704603109242421 [ 142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=0 skb=00000000d5f230df [ 142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 status=0 [ 142.795831] MPTCP: msk ack_seq=66790290f1199d9b subflow ack_seq=66790290f1199d9b [ 142.795831] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=1 skb=00000000de5aca2e Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02mptcp: fix "fn parameter not described" warningsMatthieu Baerts
Obtained with: $ make W=1 net/mptcp/token.o net/mptcp/token.c:53: warning: Function parameter or member 'req' not described in 'mptcp_token_new_request' net/mptcp/token.c:98: warning: Function parameter or member 'sk' not described in 'mptcp_token_new_connect' net/mptcp/token.c:133: warning: Function parameter or member 'conn' not described in 'mptcp_token_new_accept' net/mptcp/token.c:178: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy_request' net/mptcp/token.c:191: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy' Fixes: 79c0949e9a09 (mptcp: Add key generation and token tree) Fixes: 58b09919626b (mptcp: create msk early) Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02mptcp: re-check dsn before reading from subflowFlorian Westphal
mptcp_subflow_data_available() is commonly called via ssk->sk_data_ready(), in this case the mptcp socket lock cannot be acquired. Therefore, while we can safely discard subflow data that was already received up to msk->ack_seq, we cannot be sure that 'subflow->data_avail' will still be valid at the time userspace wants to read the data -- a previous read on a different subflow might have carried this data already. In that (unlikely) event, msk->ack_seq will have been updated and will be ahead of the subflow dsn. We can check for this condition and skip/resync to the expected sequence number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02mptcp: subflow: check parent mptcp socket on subflow state changeFlorian Westphal
This is needed at least until proper MPTCP-Level fin/reset signalling gets added: We wake parent when a subflow changes, but we should do this only when all subflows have closed, not just one. Schedule the mptcp worker and tell it to check eof state on all subflows. Only flag mptcp socket as closed and wake userspace processes blocking in poll if all subflows have closed. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02mptcp: fix tcp fallback crashFlorian Westphal
Christoph Paasch reports following crash: general protection fault [..] CPU: 0 PID: 2874 Comm: syz-executor072 Not tainted 5.6.0-rc5 #62 RIP: 0010:__pv_queued_spin_lock_slowpath kernel/locking/qspinlock.c:471 [..] queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline] do_raw_spin_lock include/linux/spinlock.h:181 [inline] spin_lock_bh include/linux/spinlock.h:343 [inline] __mptcp_flush_join_list+0x44/0xb0 net/mptcp/protocol.c:278 mptcp_shutdown+0xb3/0x230 net/mptcp/protocol.c:1882 [..] Problem is that mptcp_shutdown() socket isn't an mptcp socket, its a plain tcp_sk. Thus, trying to access mptcp_sk specific members accesses garbage. Root cause is that accept() returns a fallback (tcp) socket, not an mptcp one. There is code in getpeername to detect this and override the sockets stream_ops. But this will only run when accept() caller provided a sockaddr struct. "accept(fd, NULL, 0)" will therefore result in mptcp stream ops, but with sock->sk pointing at a tcp_sk. Update the existing fallback handling to detect this as well. Moreover, mptcp_shutdown did not have fallback handling, and mptcp_poll did it too late so add that there as well. Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: add netlink-based PMPaolo Abeni
Expose a new netlink family to userspace to control the PM, setting: - list of local addresses to be signalled. - list of local addresses used to created subflows. - maximum number of add_addr option to react When the msk is fully established, the PM netlink attempts to announce the 'signal' list via the ADD_ADDR option. Since we currently lack the ADD_ADDR echo (and related event) only the first addr is sent. After exhausting the 'announce' list, the PM tries to create subflow for each addr in 'local' list, waiting for each connection to be completed before attempting the next one. Idea is to add an additional PM hook for ADD_ADDR echo, to allow the PM netlink announcing multiple addresses, in sequence. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: add and use MIB counter infrastructureFlorian Westphal
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s" or "nstat" will show them automatically. The MPTCP MIB counters are allocated in a distinct pcpu area in order to avoid bloating/wasting TCP pcpu memory. Counters are allocated once the first MPTCP socket is created in a network namespace and free'd on exit. If no sockets have been allocated, all-zero mptcp counters are shown. The MIB counter list is taken from the multipath-tcp.org kernel, but only a few counters have been picked up so far. The counter list can be increased at any time later on. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: allow dumping subflow context to userspaceDavide Caratti
add ulp-specific diagnostic functions, so that subflow information can be dumped to userspace programs like 'ss'. v2 -> v3: - uapi: use bit macros appropriate for userspace Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: implement and use MPTCP-level retransmissionPaolo Abeni
On timeout event, schedule a work queue to do the retransmission. Retransmission code closely resembles the sendmsg() implementation and re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags' sake - and peeking the relevant dfrag from the rtx head. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: rework mptcp_sendmsg_frag to accept optional dfragPaolo Abeni
This will simplify mptcp-level retransmission implementation in the next patch. If dfrag is provided by the caller, skip kernel space memory allocation and use data and metadata provided by the dfrag itself. Because a peer could ack data at TCP level but refrain from sending mptcp-level ACKs, we could grow the mptcp socket backlog indefinitely. We should thus block mptcp_sendmsg until the peer has acked some of the sent data. In order to be able to do so, increment the mptcp socket wmem_queued counter on memory allocation and decrement it when releasing the memory on mptcp-level ack reception. Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make this the mptcp sk_sndbuf limit. In the future we could add experiment with autotuning as TCP does in tcp_sndbuf_expand(). v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: allow partial cleaning of rtx head dfragFlorian Westphal
After adding wmem accounting for the mptcp socket we could get into a situation where the mptcp socket can't transmit more data, and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced because it currently will only remove entire dfrags. Allow advancing the dfrag head sequence and reduce wmem, even though this isn't correct (as we can't release the page). Because we will soon block on mptcp sk in case wmem is too large, call sk_stream_write_space() in case we reduced the backlog so userspace task blocked in sendmsg or poll will be woken up. This isn't an issue if the send buffer is large, but it is when SO_SNDBUF is used to reduce it to a lower value. Note we can still get a deadlock for low SO_SNDBUF values in case both sides of the connection write to the socket: both could be blocked due to wmem being too small -- and current mptcp stack will only increment mptcp ack_seq on recv. This doesn't happen with the selftest as it uses poll() and will always call recv if there is data to read. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: implement memory accounting for mptcp rtx queuePaolo Abeni
Charge the data on the rtx queue to the master MPTCP socket, too. Such memory in uncharged when the data is acked/dequeued. Also account mptcp sockets inuse via a protocol specific pcpu counter. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: introduce MPTCP retransmission timerPaolo Abeni
The timer will be used to schedule retransmission. It's frequency is based on the current subflow RTO estimation and is reset on every una_seq update The timer is clearer for good by __mptcp_clear_xmit() Also clean MPTCP rtx queue before each transmission. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: queue data for mptcp level retransmissionPaolo Abeni
Keep the send page fragment on an MPTCP level retransmission queue. The queue entries are allocated inside the page frag allocator, acquiring an additional reference to the page for each list entry. Also switch to a custom page frag refill function, to ensure that the current page fragment can always host an MPTCP rtx queue entry. The MPTCP rtx queue is flushed at disconnect() and close() time Note that now we need to call __mptcp_init_sock() regardless of mptcp enable status, as the destructor will try to walk the rtx_queue. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: update per unacked sequence on pkt receptionPaolo Abeni
So that we keep per unacked sequence number consistent; since we update per msk data, use an atomic64 cmpxchg() to protect against concurrent updates from multiple subflows. Initialize the snd_una at connect()/accept() time. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Implement path manager interface commandsPeter Krystad
Fill in more path manager functionality by adding a worker function and modifying the related stub functions to schedule the worker. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add handling of outgoing MP_JOIN requestsPeter Krystad
Subflow creation may be initiated by the path manager when the primary connection is fully established and a remote address has been received via ADD_ADDR. Create an in-kernel sock and use kernel_connect() to initiate connection. Passive sockets can't acquire the mptcp socket lock at subflow creation time, so an additional list protected by a new spinlock is used to track the MPJ subflows. Such list is spliced into conn_list tail every time the msk socket lock is acquired, so that it will not interfere with data flow on the original connection. Data flow and connection failover not addressed by this commit. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add handling of incoming MP_JOIN requestsPeter Krystad
Process the MP_JOIN option in a SYN packet with the same flow as MP_CAPABLE but when the third ACK is received add the subflow to the MPTCP socket subflow list instead of adding it to the TCP socket accept queue. The subflow is added at the end of the subflow list so it will not interfere with the existing subflows operation and no data is expected to be transmitted on it. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add path manager interfacePeter Krystad
Add enough of a path manager interface to allow sending of ADD_ADDR when an incoming MPTCP connection is created. Capable of sending only a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection sock will need to be expanded to handle multiple interfaces and IPv6. Partial processing of the incoming ADD_ADDR is included so the path manager notification of that event happens at the proper time, which involves validating the incoming address information. This is a skeleton interface definition for events generated by MPTCP. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: Add ADD_ADDR handlingPeter Krystad
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6, and RM_ADDR suboptions. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23net: mptcp: don't hang in mptcp_sendmsg() after TCP fallbackDavide Caratti
it's still possible for packetdrill to hang in mptcp_sendmsg(), when the MPTCP socket falls back to regular TCP (e.g. after receiving unsupported flags/version during the three-way handshake). Adjust MPTCP socket state earlier, to ensure correct functionality of mptcp_sendmsg() even in case of TCP fallback. Fixes: 767d3ded5fb8 ("net: mptcp: don't hang before sending 'MP capable with data'") Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21mptcp: Remove set but not used variable 'can_ack'YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning: net/mptcp/options.c: In function 'mptcp_established_options_dss': net/mptcp/options.c:338:7: warning: variable 'can_ack' set but not used [-Wunused-but-set-variable] commit dc093db5cc05 ("mptcp: drop unneeded checks") leave behind this unused, remove it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19mptcp: rename fourth ack fieldPaolo Abeni
The name is misleading, it actually tracks the 'fully established' status. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17mptcp: move msk state update to subflow_syn_recv_sock()Paolo Abeni
After commit 58b09919626b ("mptcp: create msk early"), the msk socket is already available at subflow_syn_recv_sock() time. Let's move there the state update, to mirror more closely the first subflow state. The above will also help multiple subflow supports. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: drop unneeded checksPaolo Abeni
After the previous patch subflow->conn is always != NULL and is never changed. We can drop a bunch of now unneeded checks. v1 -> v2: - rebased on top of commit 2398e3991bda ("mptcp: always include dack if possible.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: create msk earlyPaolo Abeni
This change moves the mptcp socket allocation from mptcp_accept() to subflow_syn_recv_sock(), so that subflow->conn is now always set for the non fallback scenario. It allows cleaning up a bit mptcp_accept() reducing the additional locking and will allow fourther cleanup in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11net: mptcp: don't hang before sending 'MP capable with data'Davide Caratti
the following packetdrill script socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey> < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]> > . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]> getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 write(3, ..., 1000) = 1000 doesn't transmit 1KB data packet after a successful three-way-handshake, using mp_capable with data as required by protocol v1, and write() hangs forever: PID: 973 TASK: ffff97dd399cae80 CPU: 1 COMMAND: "packetdrill" #0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000 #1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0 #2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d #3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184 #4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064 #5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c #6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324 #7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7 #8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976 #9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685 #10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985 #11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475 #12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c RIP: 00007f959407eaf7 RSP: 00007ffe9e95a910 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f959407eaf7 RDX: 00000000000003e8 RSI: 0000000001785fe0 RDI: 0000000000000008 RBP: 0000000001785fe0 R8: 0000000000000000 R9: 0000000000000003 R10: 0000000000000007 R11: 0000000000000293 R12: 00000000000003e8 R13: 00007ffe9e95ae30 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the third ack. Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09mptcp: don't grow mptcp socket receive buffer when rcvbuf is lockedFlorian Westphal
The mptcp rcvbuf size is adjusted according to the subflow rcvbuf size. This should not be done if userspace did set a fixed value. Fixes: 600911ff5f72bae ("mptcp: add rmem queue accounting") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05mptcp: always include dack if possible.Paolo Abeni
Currently passive MPTCP socket can skip including the DACK option - if the peer sends data before accept() completes. The above happens because the msk 'can_ack' flag is set only after the accept() call. Such missing DACK option may cause - as per RFC spec - unwanted fallback to TCP. This change addresses the issue using the key material available in the current subflow, if any, to create a suitable dack option when msk ack seq is not yet available. v1 -> v2: - adavance the generated ack after the initial MPC packet Fixes: d22f4988ffec ("mptcp: process MP_CAPABLE data option") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03mptcp: Only send DATA_FIN with final mappingMat Martineau
When a DATA_FIN is sent in a MPTCP DSS option that contains a data mapping, the DATA_FIN consumes one byte of space in the mapping. In this case, the DATA_FIN should only be included in the DSS option if its sequence number aligns with the end of the mapped data. Otherwise the subflow can send an incorrect implicit sequence number for the DATA_FIN, and the DATA_ACK for that sequence number would not close the MPTCP-level connection correctly. Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03mptcp: Use per-subflow storage for DATA_FIN sequence numberMat Martineau
Instead of reading the MPTCP-level sequence number when sending DATA_FIN, store the data in the subflow so it can be safely accessed when the subflow TCP headers are written to the packet without the MPTCP-level lock held. This also allows the MPTCP-level socket to close individual subflows without closing the MPTCP connection. Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>