summaryrefslogtreecommitdiff
path: root/net/mptcp
AgeCommit message (Collapse)Author
2020-07-27mptcp: fix joined subflows with unblocking skMatthieu Baerts
Unblocking sockets used for outgoing connections were not containing inet info about the initial connection due to a typo there: the value of "err" variable is negative in the kernelspace. This fixes the creation of additional subflows where the remote port has to be reused if the other host didn't announce another one. This also fixes inet_diag showing blank info about MPTCP sockets from unblocking sockets doing a connect(). Fixes: 41be81a8d3d0 ("mptcp: fix unblocking connect()") Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24net: pass a sockptr_t into ->setsockoptChristoph Hellwig
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24net: switch sock_set_timeout to sockptr_tChristoph Hellwig
Pass a sockptr_t to prepare for set_fs-less handling of the kernel pointer from bpf-cgroup. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23subflow: introduce and use mptcp_can_accept_new_subflow()Paolo Abeni
So that we can easily perform some basic PM-related adimission checks before creating the child socket. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23subflow: use rsk_ops->send_reset()Paolo Abeni
tcp_send_active_reset() is more prone to transient errors (memory allocation or xmit queue full): in stress conditions the kernel may drop the egress packet, and the client will be stuck. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23subflow: explicitly check for plain tcp rskPaolo Abeni
When syncookie are in use, the TCP stack may feed into subflow_syn_recv_sock() plain TCP request sockets. We can't access mptcp_subflow_request_sock-specific fields on such sockets. Explicitly check the rsk ops to do safe accesses. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23mptcp: cleanup subflow_finish_connect()Paolo Abeni
The mentioned function has several unneeded branches, handle each case - MP_CAPABLE, MP_JOIN, fallback - under a single conditional and drop quite a bit of duplicate code. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23mptcp: explicitly track the fully established statusPaolo Abeni
Currently accepted msk sockets become established only after accept() returns the new sk to user-space. As MP_JOIN request are refused as per RFC spec on non fully established socket, the above causes mp_join self-tests instabilities. This change lets the msk entering the established status as soon as it receives the 3rd ack and propagates the first subflow fully established status on the msk socket. Finally we can change the subflow acceptance condition to take in account both the sock state and the msk fully established flag. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23mptcp: mark as fallback even early onesPaolo Abeni
In the unlikely event of a failure at connect time, we currently clear the request_mptcp flag - so that the MPC handshake is not started at all, but the msk is not explicitly marked as fallback. This would lead to later insertion of wrong DSS options in the xmitted packets, in violation of RFC specs and possibly fooling the peer. Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23mptcp: avoid data corruption on reinsertPaolo Abeni
When updating a partially acked data fragment, we actually corrupt it. This is irrelevant till we send data on a single subflow, as retransmitted data, if any are discarded by the peer as duplicate, but it will cause data corruption as soon as we will start creating non backup subflows. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23subflow: always init 'rel_write_seq'Paolo Abeni
Currently we do not init the subflow write sequence for MP_JOIN subflows. This will cause bad mapping being generated as soon as we will use non backup subflow. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22mptcp: zero token hash at creation time.Paolo Abeni
Otherwise the 'chain_len' filed will carry random values, some token creation calls will fail due to excessive chain length, causing unexpected fallback to TCP. Fixes: 2c5ebd001d4f ("mptcp: refactor token container") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-21mptcp: move helper to where its usedFlorian Westphal
Only used in token.c. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19net: remove compat_sock_common_{get,set}sockoptChristoph Hellwig
Add the compat handling to sock_common_{get,set}sockopt instead, keyed of in_compat_syscall(). This allow to remove the now unused ->compat_{get,set}sockopt methods from struct proto_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-17mptcp: silence warning in subflow_data_ready()Davide Caratti
since commit d47a72152097 ("mptcp: fix race in subflow_data_ready()"), it is possible to observe a regression in MP_JOIN kselftests. For sockets in TCP_CLOSE state, it's not sufficient to just wake up the main socket: we also need to ensure that received data are made available to the reader. Silence the WARN_ON_ONCE() in these cases: it preserves the syzkaller fix and restores kselftests when they are ran as follows: # while true; do > make KBUILD_OUTPUT=/tmp/kselftest TARGETS=net/mptcp kselftest > done Reported-by: Florian Westphal <fw@strlen.de> Fixes: d47a72152097 ("mptcp: fix race in subflow_data_ready()") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/47 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-16mptcp: use sha256() instead of open codingEric Biggers
Now that there's a function that calculates the SHA-256 digest of a buffer in one step, use it instead of sha256_init() + sha256_update() + sha256_final(). Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Cc: mptcp@lists.01.org Cc: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09mptcp: add MPTCP socket diag interfacePaolo Abeni
exposes basic inet socket attribute, plus some MPTCP socket fields comprising PM status and MPTCP-level sequence numbers. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-09mptcp: add msk interations helperPaolo Abeni
mptcp_token_iter_next() allow traversing all the MPTCP sockets inside the token container belonging to the given network namespace with a quite standard iterator semantic. That will be used by the next patch, but keep the API generic, as we plan to use this later for PM's sake. Additionally export mptcp_token_get_sock(), as it also will be used by the diag module. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07mptcp: fix DSS map generation on fin retransmissionPaolo Abeni
The RFC 8684 mandates that no-data DATA FIN packets should carry a DSS with 0 sequence number and data len equal to 1. Currently, on FIN retransmission we re-use the existing mapping; if the previous fin transmission was part of a partially acked data packet, we could end-up writing in the egress packet a non-compliant DSS. The above will be detected by a "Bad mapping" warning on the receiver side. This change addresses the issue explicitly checking for 0 len packet when adding the DATA_FIN option. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07mptcp: use mptcp worker for path managementFlorian Westphal
We can re-use the existing work queue to handle path management instead of a dedicated work queue. Just move pm_worker to protocol.c, call it from the mptcp worker and get rid of the msk lock (already held). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-06mptcp: fix race in subflow_data_ready()Davide Caratti
syzkaller was able to make the kernel reach subflow_data_ready() for a server subflow that was closed before subflow_finish_connect() completed. In these cases we can avoid using the path for regular/fallback MPTCP data, and just wake the main socket, to avoid the following warning: WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885 subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xb7/0xfe lib/dump_stack.c:118 panic+0x29e/0x692 kernel/panic.c:221 __warn.cold+0x2f/0x3d kernel/panic.c:582 report_bug+0x28b/0x2f0 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:105 [inline] fixup_bug arch/x86/kernel/traps.c:100 [inline] do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18 89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9 59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206 RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609 RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001 RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336 R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570 tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841 tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642 tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999 ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204 ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:441 [inline] ip_rcv_finish net/ipv4/ip_input.c:428 [inline] ip_rcv_finish net/ipv4/ip_input.c:414 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268 __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382 process_backlog+0x1e5/0x6d0 net/core/dev.c:6226 napi_poll net/core/dev.c:6671 [inline] net_rx_action+0x3e3/0xd70 net/core/dev.c:6739 __do_softirq+0x18c/0x634 kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 </IRQ> do_softirq.part.0+0x26/0x30 kernel/softirq.c:337 do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189 local_bh_enable include/linux/bottom_half.h:32 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline] ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229 __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306 dst_output include/net/dst.h:435 [inline] ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125 __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530 __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238 __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785 __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline] tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline] tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209 tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651 sk_backlog_rcv include/net/sock.h:996 [inline] __release_sock+0x1ad/0x310 net/core/sock.c:2548 release_sock+0x54/0x1a0 net/core/sock.c:3064 inet_wait_for_connect net/ipv4/af_inet.c:594 [inline] __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686 inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725 mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920 __sys_connect_file net/socket.c:1854 [inline] __sys_connect+0x267/0x2f0 net/socket.c:1871 __do_sys_connect net/socket.c:1882 [inline] __se_sys_connect net/socket.c:1879 [inline] __x64_sys_connect+0x6f/0xb0 net/socket.c:1879 do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fb577d06469 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469 RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39 Reported-by: Christoph Paasch <cpaasch@apple.com> Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04mptcp: support IPV6_V6ONLY setsockoptFlorian Westphal
Without this, Opensshd fails to open an ipv6 socket listening socket: error: setsockopt IPV6_V6ONLY: Operation not supported error: Bind to port 22 on :: failed: Address already in use. Opensshd opens an ipv4 and and ipv6 listening socket, but because IPV6_V6ONLY setsockopt fails, the port number is already in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04mptcp: add REUSEADDR/REUSEPORT supportFlorian Westphal
This will e.g. make 'sshd restart' work when MPTCP is used, as we will now set this option on the listener socket instead of only the mptcp socket (where it has no effect). We still need to copy the setting to the master socket so that a subsequent getsockopt() returns the expected value. Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04net: use mptcp setsockopt function for SOL_SOCKET on mptcp socketsFlorian Westphal
setsockopt(mptcp_fd, SOL_SOCKET, ...)... appears to work (returns 0), but it has no effect -- this is because the MPTCP layer never has a chance to copy the settings to the subflow socket. Skip the generic handling for the mptcp case and instead call the mptcp specific handler instead for SOL_SOCKET too. Next patch adds more specific handling for SOL_SOCKET to mptcp. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01mptcp: add receive buffer auto-tuningFlorian Westphal
When mptcp is used, userspace doesn't read from the tcp (subflow) socket but from the parent (mptcp) socket receive queue. skbs are moved from the subflow socket to the mptcp rx queue either from 'data_ready' callback (if mptcp socket can be locked), a work queue, or the socket receive function. This means tcp_rcv_space_adjust() is never called and thus no receive buffer size auto-tuning is done. An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the function that moves skbs from subflow to mptcp socket. While this enabled autotuning, it also meant tuning was done even if userspace was reading the mptcp socket very slowly. This adds mptcp_rcv_space_adjust() and calls it after userspace has read data from the mptcp socket rx queue. Its very similar to tcp_rcv_space_adjust, with two differences: 1. The rtt estimate is the largest one observed on a subflow 2. The rcvbuf size and window clamp of all subflows is adjusted to the mptcp-level rcvbuf. Otherwise, we get spurious drops at tcp (subflow) socket level if the skbs are not moved to the mptcp socket fast enough. Before: time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap [..] ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ] Time: 1396 seconds After: ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ] Time: 296 seconds Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30mptcp: do nonce initialization at subflow creation timePaolo Abeni
This clean-up the code a bit, reduces the number of used hooks and indirect call requested, and allow better error reporting from __mptcp_subflow_connect() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29mptcp: close poll() racesPaolo Abeni
mptcp_poll always return POLLOUT for unblocking connect(), ensure that the socket is a suitable state. The MPTCP_DATA_READY bit is never cleared on accept: ensure we don't leave mptcp_accept() with an empty accept queue and such bit set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29mptcp: __mptcp_tcp_fallback() returns a struct sockPaolo Abeni
Currently __mptcp_tcp_fallback() always return NULL on incoming connections, because MPTCP does not create the additional socket for the first subflow. Since the previous commit no __mptcp_tcp_fallback() caller needs a struct socket, so let __mptcp_tcp_fallback() return the first subflow sock and cope correctly even with incoming connections. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29mptcp: create first subflow at msk creation timePaolo Abeni
This cleans the code a bit and makes the behavior more consistent. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29mptcp: check for plain TCP sock at accept timePaolo Abeni
This cleanup the code a bit and avoid corrupted states on weird syscall sequence (accept(), connect()). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29mptcp: fallback in case of simultaneous connectDavide Caratti
when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29net: mptcp: improve fallback to TCPDavide Caratti
Keep using MPTCP sockets and a use "dummy mapping" in case of fallback to regular TCP. When fallback is triggered, skip addition of the MPTCP option on send. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22 Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26mptcp: introduce token KUNIT self-testsPaolo Abeni
Unit tests for the internal MPTCP token APIs, using KUNIT v1 -> v2: - use the correct RCU annotation when initializing icsk ulp - fix a few checkpatch issues Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26mptcp: move crypto test to KUNITPaolo Abeni
currently MPTCP uses a custom hook to executed unit tests at boot time. Let's use the KUNIT framework instead. Additionally move the relevant code to a separate file and export the function needed by the test when self-tests are build as a module. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26mptcp: refactor token containerPaolo Abeni
Replace the radix tree with a hash table allocated at boot time. The radix tree has some shortcoming: a single lock is contented by all the mptcp operation, the lookup currently use such lock, and traversing all the items would require a lock, too. With hash table instead we trade a little memory to address all the above - a per bucket lock is used. To hash the MPTCP sockets, we re-use the msk' sk_node entry: the MPTCP sockets are never hashed by the stack. Replace the existing hash proto callbacks with a dummy implementation, annotating the above constraint. Additionally refactor the token creation to code to: - limit the number of consecutive attempts to a fixed maximum. Hitting a hash bucket with a long chain is considered a failed attempt - accept() no longer can fail to token management. - if token creation fails at connect() time, we do fallback to TCP (before the connection was closed) v1 -> v2: - fix "no newline at end of file" - Jakub Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-26mptcp: add __init annotation on setup functionsPaolo Abeni
Add the missing annotation in some setup-only functions. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor overlapping changes in xfrm_device.c, between the double ESP trailing bug fix setting the XFRM_INIT flag and the changes in net-next preparing for bonding encryption support. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-23tcp: move ipv4_specific to tcp include fileEric Dumazet
Declare ipv4_specific once, in tcp.h were it belongs. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-23tcp: move ipv6_specific declaration to remove a warningEric Dumazet
ipv6_specific should be declared in tcp include files, not mptcp. This removes the following warning : CHECK net/ipv6/tcp_ipv6.c net/ipv6/tcp_ipv6.c:78:42: warning: symbol 'ipv6_specific' was not declared. Should it be static? Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22mptcp: drop sndr_key in mptcp_syn_optionsGeliang Tang
In RFC 8684, we don't need to send sndr_key in SYN package anymore, so drop it. Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18mptcp: drop MP_JOIN request sock on syn cookiesPaolo Abeni
Currently any MPTCP socket using syn cookies will fallback to TCP at 3rd ack time. In case of MP_JOIN requests, the RFC mandate closing the child and sockets, but the existing error paths do not handle the syncookie scenario correctly. Address the issue always forcing the child shutdown in case of MP_JOIN fallback. Fixes: ae2dd7164943 ("mptcp: handle tcp fallback when using syn cookies") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18mptcp: cache msk on MP_JOIN init_reqPaolo Abeni
The msk ownership is transferred to the child socket at 3rd ack time, so that we avoid more lookups later. If the request does not reach the 3rd ack, the MSK reference is dropped at request sock release time. As a side effect, fallback is now tracked by a NULL msk reference instead of zeroed 'mp_join' field. This will simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15mptcp: fix memory leak in mptcp_subflow_create_socket()Wei Yongjun
socket malloced by sock_create_kern() should be release before return in the error handling, otherwise it cause memory leak. unreferenced object 0xffff88810910c000 (size 1216): comm "00000003_test_m", pid 12238, jiffies 4295050289 (age 54.237s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff ........./0..... backtrace: [<00000000e877f89f>] sock_alloc_inode+0x18/0x1c0 [<0000000093d1dd51>] alloc_inode+0x63/0x1d0 [<000000005673fec6>] new_inode_pseudo+0x14/0xe0 [<00000000b5db6be8>] sock_alloc+0x3c/0x260 [<00000000e7e3cbb2>] __sock_create+0x89/0x620 [<0000000023e48593>] mptcp_subflow_create_socket+0xc0/0x5e0 [<00000000419795e4>] __mptcp_socket_create+0x1ad/0x3f0 [<00000000b2f942e8>] mptcp_stream_connect+0x281/0x4f0 [<00000000c80cd5cc>] __sys_connect_file+0x14d/0x190 [<00000000dc761f11>] __sys_connect+0x128/0x160 [<000000008b14e764>] __x64_sys_connect+0x6f/0xb0 [<000000007b4f93bd>] do_syscall_64+0xa1/0x530 [<00000000d3e770b6>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15mptcp: use list_first_entry_or_nullGeliang Tang
Use list_first_entry_or_null to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15mptcp: drop MPTCP_PM_MAX_ADDRGeliang Tang
We have defined MPTCP_PM_ADDR_MAX in pm_netlink.c, so drop this duplicate macro. Fixes: 1b1c7a0ef7f3 ("mptcp: Add path manager interface") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-10mptcp: don't leak msk in token containerPaolo Abeni
If a listening MPTCP socket has unaccepted sockets at close time, the related msks are freed via mptcp_sock_destruct(), which in turn does not invoke the proto->destroy() method nor the mptcp_token_destroy() function. Due to the above, the child msk socket is not removed from the token container, leading to later UaF. Address the issue explicitly removing the token even in the above error path. Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-10mptcp: fix races between shutdown and recvmsgPaolo Abeni
The msk sk_shutdown flag is set by a workqueue, possibly introducing some delay in user-space notification. If the last subflow carries some data with the fin packet, the user space can wake-up before RCV_SHUTDOWN is set. If it executes unblocking recvmsg(), it may return with an error instead of eof. Address the issue explicitly checking for eof in recvmsg(), when no data is found. Fixes: 59832e246515 ("mptcp: subflow: check parent mptcp socket on subflow state change") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-08mptcp: bugfix for RM_ADDR option parsingGeliang Tang
In MPTCPOPT_RM_ADDR option parsing, the pointer "ptr" pointed to the "Subtype" octet, the pointer "ptr+1" pointed to the "Address ID" octet: +-------+-------+---------------+ |Subtype|(resvd)| Address ID | +-------+-------+---------------+ | | ptr ptr+1 We should set mp_opt->rm_id to the value of "ptr+1", not "ptr". This patch will fix this bug. Fixes: 3df523ab582c ("mptcp: Add ADD_ADDR handling") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...