summaryrefslogtreecommitdiff
path: root/net/mptcp/subflow.c
AgeCommit message (Collapse)Author
2023-06-01mptcp: consolidate passive msk socket initializationPaolo Abeni
When the msk socket is cloned at MPC handshake time, a few fields are initialized in a racy way outside mptcp_sk_clone() and the msk socket lock. The above is due historical reasons: before commit a88d0092b24b ("mptcp: simplify subflow_syn_recv_sock()") as the first subflow socket carrying all the needed date was not available yet at msk creation time We can now refactor the code moving the missing initialization bit under the socket lock, removing the init race and avoiding some code duplication. This will also simplify the next patch, as all msk->first write access are now under the msk socket lock. Fixes: 0397c6d85f9c ("mptcp: keep unaccepted MPC subflow into join list") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Adjacent changes: net/mptcp/protocol.h 63740448a32e ("mptcp: fix accept vs worker race") 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19mptcp: fix accept vs worker racePaolo Abeni
The mptcp worker and mptcp_accept() can race, as reported by Christoph: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 14351 at lib/refcount.c:25 refcount_warn_saturate+0x105/0x1b0 lib/refcount.c:25 Modules linked in: CPU: 1 PID: 14351 Comm: syz-executor.2 Not tainted 6.3.0-rc1-gde5e8fd0123c #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:refcount_warn_saturate+0x105/0x1b0 lib/refcount.c:25 Code: 02 31 ff 89 de e8 1b f0 a7 ff 84 db 0f 85 6e ff ff ff e8 3e f5 a7 ff 48 c7 c7 d8 c7 34 83 c6 05 6d 2d 0f 02 01 e8 cb 3d 90 ff <0f> 0b e9 4f ff ff ff e8 1f f5 a7 ff 0f b6 1d 54 2d 0f 02 31 ff 89 RSP: 0018:ffffc90000a47bf8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802eae98c0 RSI: ffffffff81097d4f RDI: 0000000000000001 RBP: ffff88802e712180 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88802eaea148 R12: ffff88802e712100 R13: ffff88802e712a88 R14: ffff888005cb93a8 R15: ffff88802e712a88 FS: 0000000000000000(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f277fd89120 CR3: 0000000035486002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __refcount_add include/linux/refcount.h:199 [inline] __refcount_inc include/linux/refcount.h:250 [inline] refcount_inc include/linux/refcount.h:267 [inline] sock_hold include/net/sock.h:775 [inline] __mptcp_close+0x4c6/0x4d0 net/mptcp/protocol.c:3051 mptcp_close+0x24/0xe0 net/mptcp/protocol.c:3072 inet_release+0x56/0xa0 net/ipv4/af_inet.c:429 __sock_release+0x51/0xf0 net/socket.c:653 sock_close+0x18/0x20 net/socket.c:1395 __fput+0x113/0x430 fs/file_table.c:321 task_work_run+0x96/0x100 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x4fc/0x10c0 kernel/exit.c:869 do_group_exit+0x51/0xf0 kernel/exit.c:1019 get_signal+0x12b0/0x1390 kernel/signal.c:2859 arch_do_signal_or_restart+0x25/0x260 arch/x86/kernel/signal.c:306 exit_to_user_mode_loop kernel/entry/common.c:168 [inline] exit_to_user_mode_prepare+0x131/0x1a0 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x19/0x40 kernel/entry/common.c:296 do_syscall_64+0x46/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fec4b4926a9 Code: Unable to access opcode bytes at 0x7fec4b49267f. RSP: 002b:00007fec49f9dd78 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: fffffffffffffe00 RBX: 00000000006bc058 RCX: 00007fec4b4926a9 RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000000006bc058 RBP: 00000000006bc050 R08: 00000000007df998 R09: 00000000007df998 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006bc05c R13: fffffffffffffea8 R14: 000000000000000b R15: 000000000001fe40 </TASK> The root cause is that the worker can force fallback to TCP the first mptcp subflow, actually deleting the unaccepted msk socket. We can explicitly prevent the race delaying the unaccepted msk deletion at listener shutdown time. In case the closed subflow is later accepted, just drop the mptcp context and let the user-space deal with the paired mptcp socket. Fixes: b6985b9b8295 ("mptcp: use the workqueue to destroy unaccepted sockets") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/375 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19mptcp: stops worker on unaccepted sockets at listener closePaolo Abeni
This is a partial revert of the blamed commit, with a relevant change: mptcp_subflow_queue_clean() now just change the msk socket status and stop the worker, so that the UaF issue addressed by the blamed commit is not re-introduced. The above prevents the mptcp worker from running concurrently with inet_csk_listen_stop(), as such race would trigger a warning, as reported by Christoph: RSP: 002b:00007f784fe09cd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e WARNING: CPU: 0 PID: 25807 at net/ipv4/inet_connection_sock.c:1387 inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: ffffffffffffffda RBX: 00000000006bc050 RCX: 00007f7850afd6a9 RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000004 Modules linked in: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006bc05c R13: fffffffffffffea8 R14: 00000000006bc050 R15: 000000000001fe40 </TASK> CPU: 0 PID: 25807 Comm: syz-executor.7 Not tainted 6.2.0-g778e54711659 #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:inet_csk_listen_stop+0x664/0x870 net/ipv4/inet_connection_sock.c:1387 RAX: 0000000000000000 RBX: ffff888100dfbd40 RCX: 0000000000000000 RDX: ffff8881363aab80 RSI: ffffffff81c494f4 RDI: 0000000000000005 RBP: ffff888126dad080 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff888100dfe040 R13: 0000000000000001 R14: 0000000000000000 R15: ffff888100dfbdd8 FS: 00007f7850a2c800(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b32d26000 CR3: 000000012fdd8006 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __tcp_close+0x5b2/0x620 net/ipv4/tcp.c:2875 __mptcp_close_ssk+0x145/0x3d0 net/mptcp/protocol.c:2427 mptcp_destroy_common+0x8a/0x1c0 net/mptcp/protocol.c:3277 mptcp_destroy+0x41/0x60 net/mptcp/protocol.c:3304 __mptcp_destroy_sock+0x56/0x140 net/mptcp/protocol.c:2965 __mptcp_close+0x38f/0x4a0 net/mptcp/protocol.c:3057 mptcp_close+0x24/0xe0 net/mptcp/protocol.c:3072 inet_release+0x53/0xa0 net/ipv4/af_inet.c:429 __sock_release+0x4e/0xf0 net/socket.c:651 sock_close+0x15/0x20 net/socket.c:1393 __fput+0xff/0x420 fs/file_table.c:321 task_work_run+0x8b/0xe0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x113/0x120 kernel/entry/common.c:203 __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline] syscall_exit_to_user_mode+0x1d/0x40 kernel/entry/common.c:296 do_syscall_64+0x46/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f7850af70dc RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007f7850af70dc RDX: 00007f7850a2c800 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006bd980 R08: 0000000000000000 R09: 00000000000018a0 R10: 00000000316338a4 R11: 0000000000000293 R12: 0000000000211e31 R13: 00000000006bc05c R14: 00007f785062c000 R15: 0000000000211af0 Fixes: 0a3f4f1f9c27 ("mptcp: fix UaF in listener shutdown") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Link: https://github.com/multipath-tcp/mptcp_net-next/issues/371 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-17mptcp: drop unneeded argumentPaolo Abeni
After commit 3a236aef280e ("mptcp: refactor passive socket initialization"), every mptcp_pm_fully_established() call is always invoked with a GFP_ATOMIC argument. We can then drop it. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: tools/testing/selftests/net/config 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 3a0385be133e ("selftests: add the missing CONFIG_IP_SCTP in net config") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13mptcp: use mptcp_schedule_work instead of open-coding itPaolo Abeni
Beyond reducing code duplication this also avoids scheduling the mptcp_worker on a closed socket on some edge scenarios. The addressed issue is actually older than the blamed commit below, but this fix needs it as a pre-requisite. Fixes: ba8f48f7a4d7 ("mptcp: introduce mptcp_schedule_work") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-29mptcp: simplify subflow_syn_recv_sock()Paolo Abeni
Postpone the msk cloning to the child process creation so that we can avoid a bunch of conditionals. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/61 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-29mptcp: avoid unneeded address copyPaolo Abeni
In the syn_recv fallback path, the msk is unused. We can skip setting the socket address. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/wireless/nl80211.c b27f07c50a73 ("wifi: nl80211: fix puncturing bitmap policy") cbbaf2bb829b ("wifi: nl80211: add a command to enable/disable HW timestamping") https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au tools/testing/selftests/net/Makefile 62199e3f1658 ("selftests: net: Add VXLAN MDB test") 13715acf8ab5 ("selftest: Add test for bind() conflicts.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17mptcp: annotate lockless accesses to sk->sk_errEric Dumazet
mptcp_poll() reads sk->sk_err without socket lock held/owned. Add READ_ONCE() and WRITE_ONCE() to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-10mptcp: avoid setting TCP_CLOSE state twiceMatthieu Baerts
tcp_set_state() is called from tcp_done() already. There is then no need to first set the state to TCP_CLOSE, then call tcp_done(). Fixes: d582484726c4 ("mptcp: fix fallback for MP_JOIN subflows") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/362 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10mptcp: add ro_after_init for tcp{,v6}_prot_overrideGeliang Tang
Add __ro_after_init labels for the variables tcp_prot_override and tcpv6_prot_override, just like other variables adjacent to them, to indicate that they are initialised from the init hooks and no writes occur afterwards. Fixes: b19bc2945b40 ("mptcp: implement delegated actions") Cc: stable@vger.kernel.org Fixes: 51fa7f8ebf0e ("mptcp: mark ops structures as ro_after_init") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10mptcp: fix UaF in listener shutdownPaolo Abeni
As reported by Christoph after having refactored the passive socket initialization, the mptcp listener shutdown path is prone to an UaF issue. BUG: KASAN: use-after-free in _raw_spin_lock_bh+0x73/0xe0 Write of size 4 at addr ffff88810cb23098 by task syz-executor731/1266 CPU: 1 PID: 1266 Comm: syz-executor731 Not tainted 6.2.0-rc59af4eaa31c1f6c00c8f1e448ed99a45c66340dd5 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0x91 print_report+0x16a/0x46f kasan_report+0xad/0x130 kasan_check_range+0x14a/0x1a0 _raw_spin_lock_bh+0x73/0xe0 subflow_error_report+0x6d/0x110 sk_error_report+0x3b/0x190 tcp_disconnect+0x138c/0x1aa0 inet_child_forget+0x6f/0x2e0 inet_csk_listen_stop+0x209/0x1060 __mptcp_close_ssk+0x52d/0x610 mptcp_destroy_common+0x165/0x640 mptcp_destroy+0x13/0x80 __mptcp_destroy_sock+0xe7/0x270 __mptcp_close+0x70e/0x9b0 mptcp_close+0x2b/0x150 inet_release+0xe9/0x1f0 __sock_release+0xd2/0x280 sock_close+0x15/0x20 __fput+0x252/0xa20 task_work_run+0x169/0x250 exit_to_user_mode_prepare+0x113/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The msk grace period can legitly expire in between the last reference count dropped in mptcp_subflow_queue_clean() and the later eventual access in inet_csk_listen_stop() After the previous patch we don't need anymore special-casing msk listener socket cleanup: the mptcp worker will process each of the unaccepted msk sockets. Just drop the now unnecessary code. Please note this commit depends on the two parent ones: mptcp: refactor passive socket initialization mptcp: use the workqueue to destroy unaccepted sockets Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets") Cc: stable@vger.kernel.org Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/346 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10mptcp: use the workqueue to destroy unaccepted socketsPaolo Abeni
Christoph reported a UaF at token lookup time after having refactored the passive socket initialization part: BUG: KASAN: use-after-free in __token_bucket_busy+0x253/0x260 Read of size 4 at addr ffff88810698d5b0 by task syz-executor653/3198 CPU: 1 PID: 3198 Comm: syz-executor653 Not tainted 6.2.0-rc59af4eaa31c1f6c00c8f1e448ed99a45c66340dd5 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0x91 print_report+0x16a/0x46f kasan_report+0xad/0x130 __token_bucket_busy+0x253/0x260 mptcp_token_new_connect+0x13d/0x490 mptcp_connect+0x4ed/0x860 __inet_stream_connect+0x80e/0xd90 tcp_sendmsg_fastopen+0x3ce/0x710 mptcp_sendmsg+0xff1/0x1a20 inet_sendmsg+0x11d/0x140 __sys_sendto+0x405/0x490 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc We need to properly clean-up all the paired MPTCP-level resources and be sure to release the msk last, even when the unaccepted subflow is destroyed by the TCP internals via inet_child_forget(). We can re-use the existing MPTCP_WORK_CLOSE_SUBFLOW infra, explicitly checking that for the critical scenario: the closed subflow is the MPC one, the msk is not accepted and eventually going through full cleanup. With such change, __mptcp_destroy_sock() is always called on msk sockets, even on accepted ones. We don't need anymore to transiently drop one sk reference at msk clone time. Please note this commit depends on the parent one: mptcp: refactor passive socket initialization Fixes: 58b09919626b ("mptcp: create msk early") Cc: stable@vger.kernel.org Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/347 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10mptcp: refactor passive socket initializationPaolo Abeni
After commit 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") unaccepted msk sockets go throu complete shutdown, we don't need anymore to delay inserting the first subflow into the subflow lists. The reference counting deserve some extra care, as __mptcp_close() is unaware of the request socket linkage to the first subflow. Please note that this is more a refactoring than a fix but because this modification is needed to include other corrections, see the following commits. Then a Fixes tag has been added here to help the stable team. Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-10mptcp: fix possible deadlock in subflow_error_reportPaolo Abeni
Christoph reported a possible deadlock while the TCP stack destroys an unaccepted subflow due to an incoming reset: the MPTCP socket error path tries to acquire the msk-level socket lock while TCP still owns the listener socket accept queue spinlock, and the reverse dependency already exists in the TCP stack. Note that the above is actually a lockdep false positive, as the chain involves two separate sockets. A different per-socket lockdep key will address the issue, but such a change will be quite invasive. Instead, we can simply stop earlier the socket error handling for orphaned or unaccepted subflows, breaking the critical lockdep chain. Error handling in such a scenario is a no-op. Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/355 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/devlink/leftover.c / net/core/devlink.c: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") f05bd8ebeb69 ("devlink: move code to a dedicated directory") 687125b5799c ("devlink: split out core code") https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-08mptcp: be careful on subflow status propagation on errorsPaolo Abeni
Currently the subflow error report callback unconditionally propagates the fallback subflow status to the owning msk. If the msk is already orphaned, the above prevents the code from correctly tracking the msk moving to the TCP_CLOSE state and doing the appropriate cleanup. All the above causes increasing memory usage over time and sporadic self-tests failures. There is a great deal of infrastructure trying to propagate correctly the fallback subflow status to the owning mptcp socket, e.g. via mptcp_subflow_eof() and subflow_sched_work_if_closed(): in the error propagation path we need only to cope with unorphaned sockets. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/339 Fixes: 15cc10453398 ("mptcp: deliver ssk errors to msk") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-08mptcp: fix locking for in-kernel listener creationPaolo Abeni
For consistency, in mptcp_pm_nl_create_listen_socket(), we need to call the __mptcp_nmpc_socket() under the msk socket lock. Note that as a side effect, mptcp_subflow_create_socket() needs a 'nested' lockdep annotation, as it will acquire the subflow (kernel) socket lock under the in-kernel listener msk socket lock. The current lack of locking is almost harmless, because the relevant socket is not exposed to the user space, but in future we will add more complexity to the mentioned helper, let's play safe. Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net/sock: Introduce trace_sk_data_ready()Peilin Ye
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-13mptcp: explicitly specify sock family at subflow creation timePaolo Abeni
Let the caller specify the to-be-created subflow family. For a given MPTCP socket created with the AF_INET6 family, the current userspace PM can already ask the kernel to create subflows in v4 and v6. If "plain" IPv4 addresses are passed to the kernel, they are automatically mapped in v6 addresses "by accident". This can be problematic because the userspace will need to pass different addresses, now the v4-mapped-v6 addresses to destroy this new subflow. On the other hand, if the MPTCP socket has been created with the AF_INET family, the command to create a subflow in v6 will be accepted but the result will not be the one as expected as new subflow will be created in IPv4 using part of the v6 addresses passed to the kernel: not creating the expected subflow then. No functional change intended for the in-kernel PM where an explicit enforcement is currently in place. This arbitrary enforcement will be leveraged by other patches in a future version. Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Cc: stable@vger.kernel.org Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-21mptcp: fix lockdep false positivePaolo Abeni
MattB reported a lockdep splat in the mptcp listener code cleanup: WARNING: possible circular locking dependency detected packetdrill/14278 is trying to acquire lock: ffff888017d868f0 ((work_completion)(&msk->work)){+.+.}-{0:0}, at: __flush_work (kernel/workqueue.c:3069) but task is already holding lock: ffff888017d84130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_INET){+.+.}-{0:0}: __lock_acquire (kernel/locking/lockdep.c:5055) lock_acquire (kernel/locking/lockdep.c:466) lock_sock_nested (net/core/sock.c:3463) mptcp_worker (net/mptcp/protocol.c:2614) process_one_work (kernel/workqueue.c:2294) worker_thread (include/linux/list.h:292) kthread (kernel/kthread.c:376) ret_from_fork (arch/x86/entry/entry_64.S:312) -> #0 ((work_completion)(&msk->work)){+.+.}-{0:0}: check_prev_add (kernel/locking/lockdep.c:3098) validate_chain (kernel/locking/lockdep.c:3217) __lock_acquire (kernel/locking/lockdep.c:5055) lock_acquire (kernel/locking/lockdep.c:466) __flush_work (kernel/workqueue.c:3070) __cancel_work_timer (kernel/workqueue.c:3160) mptcp_cancel_work (net/mptcp/protocol.c:2758) mptcp_subflow_queue_clean (net/mptcp/subflow.c:1817) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET); lock((work_completion)(&msk->work)); lock(sk_lock-AF_INET); lock((work_completion)(&msk->work)); *** DEADLOCK *** The report is actually a false positive, since the only existing lock nesting is the msk socket lock acquired by the mptcp work. cancel_work_sync() is invoked without the relevant socket lock being held, but under a different (the msk listener) socket lock. We could silence the splat adding a per workqueue dynamic lockdep key, but that looks overkill. Instead just tell lockdep the msk socket lock is not held around cancel_work_sync(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/322 Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Merge in the left-over fixes before the net-next pull-request. net/mptcp/subflow.c d3295fee3c75 ("mptcp: use proper req destructor for IPv6") 36b122baf6a8 ("mptcp: add subflow_v(4,6)_send_synack()") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-12mptcp: use proper req destructor for IPv6Matthieu Baerts
Before, only the destructor from TCP request sock in IPv4 was called even if the subflow was IPv6. It is important to use the right destructor to avoid memory leaks with some advanced IPv6 features, e.g. when the request socks contain specific IPv6 options. Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12mptcp: dedicated request sock for subflow in v6Matthieu Baerts
tcp_request_sock_ops structure is specific to IPv4. It should then not be used with MPTCP subflows on top of IPv6. For example, it contains the 'family' field, initialised to AF_INET. This 'family' field is used by TCP FastOpen code to generate the cookie but also by TCP Metrics, SELinux and SYN Cookies. Using the wrong family will not lead to crashes but displaying/using/checking wrong things. Note that 'send_reset' callback from request_sock_ops structure is used in some error paths. It is then also important to use the correct one for IPv4 or IPv6. The slab name can also be different in IPv4 and IPv6, it will be used when printing some log messages. The slab pointer will anyway be the same because the object size is the same for both v4 and v6. A BUILD_BUG_ON() has also been added to make sure this size is the same. Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-12mptcp: remove MPTCP 'ifdef' in TCP SYN cookiesMatthieu Baerts
To ease the maintenance, it is often recommended to avoid having #ifdef preprocessor conditions. Here the section related to CONFIG_MPTCP was quite short but the next commit needs to add more code around. It is then cleaner to move specific MPTCP code to functions located in net/mptcp directory. Now that mptcp_subflow_request_sock_ops structure can be static, it can also be marked as "read only after init". Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29mptcp: add subflow_v(4,6)_send_synack()Dmytro Shytyi
The send_synack() needs to be overridden for MPTCP to support TFO for two reasons: - There is not be enough space in the TCP options if the TFO cookie has to be added in the SYN+ACK with other options: MSS (4), SACK OK (2), Timestamps (10), Window Scale (3+1), TFO (10+2), MP_CAPABLE (12). MPTCPv1 specs -- RFC 8684, section B.1 [1] -- suggest to drop the TCP timestamps option in this case. - The data received in the SYN has to be handled: the SKB can be dequeued from the subflow sk and transferred to the MPTCP sk. Counters need to be updated accordingly and the application can be notified at the end because some bytes have been received. [1] https://www.rfc-editor.org/rfc/rfc8684.html#section-b.1 Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29mptcp: implement delayed seq generation for passive fastopenDmytro Shytyi
With fastopen in place, the first subflow socket is created before the MPC handshake completes, and we need to properly initialize the sequence numbers at MPC ACK reception. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29mptcp: consolidate initial ack seq generationPaolo Abeni
Currently the initial ack sequence is generated on demand whenever it's requested and the remote key is handy. The relevant code is scattered in different places and can lead to multiple, unneeded, crypto operations. This change consolidates the ack sequence generation code in a single helper, storing the sequence number at the subflow level. The above additionally saves a few conditional in fast-path and will simplify the upcoming fast-open implementation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/lib/bpf/ringbuf.c 927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap") b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c") https://lore.kernel.org/all/20221121122707.44d1446a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-28mptcp: fix sleep in atomic at close timePaolo Abeni
Matt reported a splat at msk close time: BUG: sleeping function called from invalid context at net/mptcp/protocol.c:2877 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 155, name: packetdrill preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by packetdrill/155: #0: ffff888001536990 (&sb->s_type->i_mutex_key#6){+.+.}-{3:3}, at: __sock_release (net/socket.c:650) #1: ffff88800b498130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973) #2: ffff88800b49a130 (sk_lock-AF_INET/1){+.+.}-{0:0}, at: __mptcp_close_ssk (net/mptcp/protocol.c:2363) #3: ffff88800b49a0b0 (slock-AF_INET){+...}-{2:2}, at: __lock_sock_fast (include/net/sock.h:1820) Preemption disabled at: 0x0 CPU: 1 PID: 155 Comm: packetdrill Not tainted 6.1.0-rc5 #365 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) __might_resched.cold (kernel/sched/core.c:9891) __mptcp_destroy_sock (include/linux/kernel.h:110) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_subflow_queue_clean (include/net/sock.h:1777) __mptcp_close_ssk (net/mptcp/protocol.c:2363) mptcp_destroy_common (net/mptcp/protocol.c:3170) mptcp_destroy (include/net/sock.h:1495) __mptcp_destroy_sock (net/mptcp/protocol.c:2886) __mptcp_close (net/mptcp/protocol.c:2959) mptcp_close (net/mptcp/protocol.c:2974) inet_release (net/ipv4/af_inet.c:432) __sock_release (net/socket.c:651) sock_close (net/socket.c:1367) __fput (fs/file_table.c:320) task_work_run (kernel/task_work.c:181 (discriminator 1)) exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49) syscall_exit_to_user_mode (kernel/entry/common.c:130) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) We can't call mptcp_close under the 'fast' socket lock variant, replace it with a sock_lock_nested() as the relevant code is already under the listening msk socket lock protection. Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/316 Fixes: 30e51b923e43 ("mptcp: fix unreleased socket in accept queue") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-26mptcp: fix tracking issue in mptcp_subflow_create_socket()Eric Dumazet
My recent patch missed that mptcp_subflow_create_socket() was creating a 'kernel' socket, then converted it to 'user' socket. Fixes: 0cafd77dcd03 ("net: add a refcount tracker for kernel sockets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20221025180546.652251-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24mptcp: set msk local address earlierPaolo Abeni
The mptcp_pm_nl_get_local_id() code assumes that the msk local address is available at that point. For passive sockets, we initialize such address at accept() time. Depending on the running configuration and the user-space timing, a passive MPJ subflow can join the msk socket before accept() completes. In such case, the PM assigns a wrong local id to the MPJ subflow and later PM netlink operations will end-up touching the wrong/unexpected subflow. All the above causes sporadic self-tests failures, especially when the host is heavy loaded. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/308 Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Fixes: d045b9eb95a9 ("mptcp: introduce implicit endpoints") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28mptcp: fix unreleased socket in accept queueMenglong Dong
The mptcp socket and its subflow sockets in accept queue can't be released after the process exit. While the release of a mptcp socket in listening state, the corresponding tcp socket will be released too. Meanwhile, the tcp socket in the unaccept queue will be released too. However, only init subflow is in the unaccept queue, and the joined subflow is not in the unaccept queue, which makes the joined subflow won't be released, and therefore the corresponding unaccepted mptcp socket will not be released to. This can be reproduced easily with following steps: 1. create 2 namespace and veth: $ ip netns add mptcp-client $ ip netns add mptcp-server $ sysctl -w net.ipv4.conf.all.rp_filter=0 $ ip netns exec mptcp-client sysctl -w net.mptcp.enabled=1 $ ip netns exec mptcp-server sysctl -w net.mptcp.enabled=1 $ ip link add red-client netns mptcp-client type veth peer red-server \ netns mptcp-server $ ip -n mptcp-server address add 10.0.0.1/24 dev red-server $ ip -n mptcp-server address add 192.168.0.1/24 dev red-server $ ip -n mptcp-client address add 10.0.0.2/24 dev red-client $ ip -n mptcp-client address add 192.168.0.2/24 dev red-client $ ip -n mptcp-server link set red-server up $ ip -n mptcp-client link set red-client up 2. configure the endpoint and limit for client and server: $ ip -n mptcp-server mptcp endpoint flush $ ip -n mptcp-server mptcp limits set subflow 2 add_addr_accepted 2 $ ip -n mptcp-client mptcp endpoint flush $ ip -n mptcp-client mptcp limits set subflow 2 add_addr_accepted 2 $ ip -n mptcp-client mptcp endpoint add 192.168.0.2 dev red-client id \ 1 subflow 3. listen and accept on a port, such as 9999. The nc command we used here is modified, which makes it use mptcp protocol by default. $ ip netns exec mptcp-server nc -l -k -p 9999 4. open another *two* terminal and use each of them to connect to the server with the following command: $ ip netns exec mptcp-client nc 10.0.0.1 9999 Input something after connect to trigger the connection of the second subflow. So that there are two established mptcp connections, with the second one still unaccepted. 5. exit all the nc command, and check the tcp socket in server namespace. And you will find that there is one tcp socket in CLOSE_WAIT state and can't release forever. Fix this by closing all of the unaccepted mptcp socket in mptcp_subflow_queue_clean() with __mptcp_close(). Now, we can ensure that all unaccepted mptcp sockets will be cleaned by __mptcp_close() before they are released, so mptcp_sock_destruct(), which is used to clean the unaccepted mptcp socket, is not needed anymore. The selftests for mptcp is ran for this commit, and no new failures. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Fixes: 6aeed9045071 ("mptcp: fix race on unaccepted mptcp sockets") Cc: stable@vger.kernel.org Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Mengen Sun <mengensun@tencent.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-05mptcp: move subflow cleanup in mptcp_destroy_common()Paolo Abeni
If the mptcp socket creation fails due to a CGROUP_INET_SOCK_CREATE eBPF program, the MPTCP protocol ends-up leaking all the subflows: the related cleanup happens in __mptcp_destroy_sock() that is not invoked in such code path. Address the issue moving the subflow sockets cleanup in the mptcp_destroy_common() helper, which is invoked in every msk cleanup path. Additionally get rid of the intermediate list_splice_init step, which is an unneeded relic from the past. The issue is present since before the reported root cause commit, but any attempt to backport the fix before that hash will require a complete rewrite. Fixes: e16163b6e2 ("mptcp: refactor shutdown and close") Reported-by: Nguyen Dinh Phi <phind.uet@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Co-developed-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26mptcp: Do not return EINPROGRESS when subflow creation succeedsMat Martineau
New subflows are created within the kernel using O_NONBLOCK, so EINPROGRESS is the expected return value from kernel_connect(). __mptcp_subflow_connect() has the correct logic to consider EINPROGRESS to be a successful case, but it has also used that error code as its return value. Before v5.19 this was benign: all the callers ignored the return value. Starting in v5.19 there is a MPTCP_PM_CMD_SUBFLOW_CREATE generic netlink command that does use the return value, so the EINPROGRESS gets propagated to userspace. Make __mptcp_subflow_connect() always return 0 on success instead. Fixes: ec3edaa7ca6c ("mptcp: Add handling of outgoing MP_JOIN requests") Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20220725205231.87529-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c 9c5de246c1db ("net: sparx5: mdb add/del handle non-sparx5 devices") fbb89d02e33a ("net: sparx5: Allow mdb entries to both CPU and ports") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28net: mptcp: fix some spelling mistake in mptcpMenglong Dong
codespell finds some spelling mistake in mptcp: net/mptcp/subflow.c:1624: interaces ==> interfaces net/mptcp/pm_netlink.c:1130: regarless ==> regardless Just fix them. Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20220627121626.1595732-1-imagedong@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: fix race on unaccepted mptcp socketsPaolo Abeni
When the listener socket owning the relevant request is closed, it frees the unaccepted subflows and that causes later deletion of the paired MPTCP sockets. The mptcp socket's worker can run in the time interval between such delete operations. When that happens, any access to msk->first will cause an UaF access, as the subflow cleanup did not cleared such field in the mptcp socket. Address the issue explicitly traversing the listener socket accept queue at close time and performing the needed cleanup on the pending msk. Note that the locking is a bit tricky, as we need to acquire the msk socket lock, while still owning the subflow socket one. Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: consistent map handling on failurePaolo Abeni
When the MPTCP receive path reach a non fatal fall-back condition, e.g. when the MPC sockets must fall-back to TCP, the existing code is a little self-inconsistent: it reports that new data is available - return true - but sets the MPC flag to the opposite value. As the consequence read operations in some exceptional scenario may block unexpectedly. Address the issue setting the correct MPC read status. Additionally avoid some code duplication in the fatal fall-back scenario. Fixes: 9c81be0dbc89 ("mptcp: add MP_FAIL response support") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: fix shutdown vs fallback racePaolo Abeni
If the MPTCP socket shutdown happens before a fallback to TCP, and all the pending data have been already spooled, we never close the TCP connection. Address the issue explicitly checking for critical condition at fallback time. Fixes: 1e39e5a32ad7 ("mptcp: infinite mapping sending") Fixes: 0348c690ed37 ("mptcp: add the fallback check") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: invoke MP_FAIL response when neededGeliang Tang
mptcp_mp_fail_no_response shouldn't be invoked on each worker run, it should be invoked only when MP_FAIL response timeout occurs. This patch refactors the MP_FAIL response logic. It leverages the fact that only the MPC/first subflow can gracefully fail to avoid unneeded subflows traversal: the failing subflow can be only msk->first. A new 'fail_tout' field is added to the subflow context to record the MP_FAIL response timeout and use such field to reliably share the timeout timer between the MP_FAIL event and the MPTCP socket close timeout. Finally, a new ack is generated to send out MP_FAIL notification as soon as we hit the relevant condition, instead of waiting a possibly unbound time for the next data packet. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/281 Fixes: d9fb797046c5 ("mptcp: Do not traverse the subflow connection list without lock") Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: introduce MAPPING_BAD_CSUMPaolo Abeni
This allow moving a couple of conditional out of the fast path, making the code more easy to follow and will simplify the next patch. Fixes: ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-28mptcp: fix error mibs accountingPaolo Abeni
The current accounting for MP_FAIL and FASTCLOSE is not very accurate: both can be increased even when the related option is not really sent. Move the accounting into the correct place. Fixes: eb7f33654dc1 ("mptcp: add the mibs for MP_FAIL") Fixes: 1e75629cb964 ("mptcp: add the mibs for MP_FASTCLOSE") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19mptcp: Check for orphaned subflow before handling MP_FAIL timerMat Martineau
MP_FAIL timeout (waiting for a peer to respond to an MP_FAIL with another MP_FAIL) is implemented using the MPTCP socket's sk_timer. That timer is also used at MPTCP socket close, so it's important to not have the two timer users interfere with each other. At MPTCP socket close, all subflows are orphaned before sk_timer is manipulated. By checking the SOCK_DEAD flag on the subflows, each subflow can determine if the timer is safe to alter without acquiring any MPTCP-level lock. This replaces code that was using the mptcp_data_lock and MPTCP-level socket state checks that did not correctly protect the timer. Fixes: 49fa1919d6bc ("mptcp: reset subflow when MP_FAIL doesn't respond") Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19mptcp: stop using the mptcp_has_another_subflow() helperPaolo Abeni
The mentioned helper requires the msk socket lock, and the current callers don't own it nor can't acquire it, so the access is racy. All the current callers are really checking for infinite mapping fallback, and the latter condition is explicitly tracked by the relevant msk variable: we can safely remove the caller usage - and the caller itself. The issue is present since MP_FAIL implementation, but the fix only applies since the infinite fallback support, ence the somewhat unexpected fixes tag. Fixes: 0530020a7c8f ("mptcp: track and update contiguous data status") Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>