summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-01-14hv_sock: Remove the accept port restrictionSunil Muthuswamy
Currently, hv_sock restricts the port the guest socket can accept connections on. hv_sock divides the socket port namespace into two parts for server side (listening socket), 0-0x7FFFFFFF & 0x80000000-0xFFFFFFFF (there are no restrictions on client port namespace). The first part (0-0x7FFFFFFF) is reserved for sockets where connections can be accepted. The second part (0x80000000-0xFFFFFFFF) is reserved for allocating ports for the peer (host) socket, once a connection is accepted. This reservation of the port namespace is specific to hv_sock and not known by the generic vsock library (ex: af_vsock). This is problematic because auto-binds/ephemeral ports are handled by the generic vsock library and it has no knowledge of this port reservation and could allocate a port that is not compatible with hv_sock (and legitimately so). The issue hasn't surfaced so far because the auto-bind code of vsock (__vsock_bind_stream) prior to the change 'VSOCK: bind to random port for VMADDR_PORT_ANY' would start walking up from LAST_RESERVED_PORT (1023) and start assigning ports. That will take a large number of iterations to hit 0x7FFFFFFF. But, after the above change to randomize port selection, the issue has started coming up more frequently. There has really been no good reason to have this port reservation logic in hv_sock from the get go. Reserving a local port for peer ports is not how things are handled generally. Peer ports should reflect the peer port. This fixes the issue by lifting the port reservation, and also returns the right peer port. Since the code converts the GUID to the peer port (by using the first 4 bytes), there is a possibility of conflicts, but that seems like a reasonable risk to take, given this is limited to vsock and that only applies to all local sockets. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: mac80211: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a conversion case for the new function, keeping the flow of the existing code as intact as possible. We also switch over to using skb_mark_not_on_list instead of a null write to skb->next. Finally, this code appeared to have a memory leak in the case where header building fails before the last gso segment. In that case, the remaining segments are not freed. So this commit also adds the proper kfree_skb_list call for the remainder of the skbs. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: netfilter: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: ipv4: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: sched: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: openvswitch: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a straight-forward conversion case for the new function, keeping the flow of the existing code as intact as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: xfrm: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is converts xfrm segment iteration to use the new function, keeping the flow of the existing code as intact as possible. One case is very straight-forward, whereas the other case has some more subtle code that likes to peak at ->next and relink skbs. By keeping the variables the same as before, we can upgrade this code with minimal surgery required. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14net: udp: use skb_list_walk_safe helper for gso segmentsJason A. Donenfeld
This is a straight-forward conversion case for the new function, iterating over the return value from udp_rcv_segment, which actually is a wrapper around skb_gso_segment. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14bpf: Return -EBADRQC for invalid map type in __bpf_tx_xdp_mapLi RongQing
A negative value should be returned if map->map_type is invalid although that is impossible now, but if we run into such situation in future, then xdpbuff could be leaked. Daniel Borkmann suggested: -EBADRQC should be returned to stay consistent with generic XDP for the tracepoint output and not to be confused with -EOPNOTSUPP from other locations like dev_map_enqueue() when ndo_xdp_xmit is missing and such. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1578618277-18085-1-git-send-email-lirongqing@baidu.com
2020-01-14netns: don't disable BHs when locking "nsid_lock"Guillaume Nault
When peernet2id() had to lock "nsid_lock" before iterating through the nsid table, we had to disable BHs, because VXLAN can call peernet2id() from the xmit path: vxlan_xmit() -> vxlan_fdb_miss() -> vxlan_fdb_notify() -> __vxlan_fdb_notify() -> vxlan_fdb_info() -> peernet2id(). Now that peernet2id() uses RCU protection, "nsid_lock" isn't used in BH context anymore. Therefore, we can safely use plain spin_lock()/spin_unlock() and let BHs run when holding "nsid_lock". Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14netns: protect netns ID lookups with RCUGuillaume Nault
__peernet2id() can be protected by RCU as it only calls idr_for_each(), which is RCU-safe, and never modifies the nsid table. rtnl_net_dumpid() can also do lockless lookups. It does two nested idr_for_each() calls on nsid tables (one direct call and one indirect call because of rtnl_net_dumpid_one() calling __peernet2id()). The netnsid tables are never updated. Therefore it is safe to not take the nsid_lock and run within an RCU-critical section instead. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14netns: Remove __peernet2id_alloc()Guillaume Nault
__peernet2id_alloc() was used for both plain lookups and for netns ID allocations (depending the value of '*alloc'). Let's separate lookups from allocations instead. That is, integrate the lookup code into __peernet2id() and make peernet2id_alloc() responsible for allocating new netns IDs when necessary. This makes it clear that __peernet2id() doesn't modify the idr and prepares the code for lockless lookups. Also, mark the 'net' argument of __peernet2id() as 'const', since we're modifying this line. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14xprtrdma: Fix oops in Receive handler after device removalChuck Lever
Since v5.4, a device removal occasionally triggered this oops: Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219 Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page Dec 2 17:13:53 manet kernel: PGD 0 P4D 0 Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883 Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma] Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0 Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246 Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048 Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001 Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000 Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428 Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010 Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000 Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0 Dec 2 17:13:53 manet kernel: Call Trace: Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core] Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core] Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9 Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74 Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30 The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that is still pointing to the old ib_device, which has been freed. The only way that is possible is if this rpcrdma_rep was not destroyed by rpcrdma_ia_remove. Debugging showed that was indeed the case: this rpcrdma_rep was still in use by a completing RPC at the time of the device removal, and thus wasn't on the rep free list. So, it was not found by rpcrdma_reps_destroy(). The fix is to introduce a list of all rpcrdma_reps so that they all can be found when a device is removed. That list is used to perform only regbuf DMA unmapping, replacing that call to rpcrdma_reps_destroy(). Meanwhile, to prevent corruption of this list, I've moved the destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs(). rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus protecting the rb_all_reps list. Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix completion wait during device removalChuck Lever
I've found that on occasion, "rmmod <dev>" will hang while if an NFS is under load. Ensure that ri_remove_done is initialized only just before the transport is woken up to force a close. This avoids the completion possibly getting initialized again while the CM event handler is waiting for a wake-up. Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix create_qp crash on device unloadChuck Lever
On device re-insertion, the RDMA device driver crashes trying to set up a new QP: Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0 Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page Nov 27 16:32:06 manet kernel: PGD 0 P4D 0 Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852 Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12 Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 <f0> 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00 Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046 Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000 Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0 Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000 Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8 Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000 Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000 Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0 Nov 27 16:32:06 manet kernel: Call Trace: Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib] Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139 Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib] The fix is to copy the qp_init_attr struct that was just created by rpcrdma_ep_create() instead of using the one from the previous connection instance. Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xfrm: interface: do not confirm neighbor when do pmtu updateXu Wang
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end, we should not call dst_confirm_neigh() as there is no two-way communication. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14xfrm interface: fix packet tx through bpf_redirect()Nicolas Dichtel
With an ebpf program that redirects packets through a xfrm interface, packets are dropped because no dst is attached to skb. This could also be reproduced with an AF_PACKET socket, with the following python script (xfrm1 is a xfrm interface): import socket send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0) # scapy # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request') # raw(p) req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00' send_s.sendto(req, ('xfrm1', 0x800, 0, 0)) It was also not possible to send an ip packet through an AF_PACKET socket because a LL header was expected. Let's remove those LL header constraints. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14vti[6]: fix packet tx through bpf_redirect()Nicolas Dichtel
With an ebpf program that redirects packets through a vti[6] interface, the packets are dropped because no dst is attached. This could also be reproduced with an AF_PACKET socket, with the following python script (vti1 is an ip_vti interface): import socket send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0) # scapy # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request') # raw(p) req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00' send_s.sendto(req, ('vti1', 0x800, 0, 0)) Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-13netfilter: arp_tables: init netns pointer in xt_tgdtor_param structFlorian Westphal
An earlier commit (1b789577f655060d98d20e, "netfilter: arp_tables: init netns pointer in xt_tgchk_param struct") fixed missing net initialization for arptables, but turns out it was incomplete. We can get a very similar struct net NULL deref during error unwinding: general protection fault: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77 xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175 cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline] translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587 do_replace net/ipv4/netfilter/arp_tables.c:981 [inline] do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461 Also init the netns pointer in xt_tgdtor_param struct. Fixes: add67461240c1d ("netfilter: add struct net * to target parameters") Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-13netfilter: fix a use-after-free in mtype_destroy()Cong Wang
map->members is freed by ip_set_free() right before using it in mtype_ext_cleanup() again. So we just have to move it down. Reported-by: syzbot+4c3cc6dbe7259dbf9054@syzkaller.appspotmail.com Fixes: 40cd63bf33b2 ("netfilter: ipset: Support extensions which need a per data destroy function") Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-11devlink: correct misspelling of snapshotJacob Keller
The function to obtain a unique snapshot id was mistakenly typo'd as devlink_region_shapshot_id_get. Fix this typo by renaming the function and all of its users. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10devlink: Wait longer before warning about unset port typeIdo Schimmel
The commit cited below causes devlink to emit a warning if a type was not set on a devlink port for longer than 30 seconds to "prevent misbehavior of drivers". This proved to be problematic when unregistering the backing netdev. The flow is always: devlink_port_type_clear() // schedules the warning unregister_netdev() // blocking devlink_port_unregister() // cancels the warning The call to unregister_netdev() can block for long periods of time for various reasons: RTNL lock is contended, large amounts of configuration to unroll following dismantle of the netdev, etc. This results in devlink emitting a warning despite the driver behaving correctly. In emulated environments (of future hardware) which are usually very slow, the warning can also be emitted during port creation as more than 30 seconds can pass between the time the devlink port is registered and when its type is set. In addition, syzbot has hit this warning [1] 1974 times since 07/11/19 without being able to produce a reproducer. Probably because reproduction depends on the load or other bugs (e.g., RTNL not being released). To prevent bogus warnings, increase the timeout to 1 hour. [1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb Fixes: 136bf27fc0e9 ("devlink: add warning in case driver does not set port type") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com Reported-by: Alex Veber <alexve@mellanox.com> Tested-by: Alex Veber <alexve@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10ipv4: Detect rollover in specific fib table dumpDavid Ahern
Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has been added to a table. The looping is caused by the key rolling over from FFFFFFFF to 0. When dumping a specific table only, we need a means to detect when the table dump is done. The key and count saved to cb args are both 0 only at the start of the table dump. If key is 0 and count > 0, then we are in the rollover case. Detect and return to avoid looping. This only affects dumps of a specific table; for dumps of all tables (the case prior to the change in the Fixes tag) inet_dump_fib moved the entry counter to the next table and reset the cb args used by fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back to 0 did not cause looping with the dumps. Fixes: effe67926624 ("net: Enable kernel side filtering of route dumps") Reported-by: Sven-Haegar Koch <haegar@sdinet.de> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10net/tls: fix async operationJakub Kicinski
Mallesham reports the TLS with async accelerator was broken by commit d10523d0b3d7 ("net/tls: free the record on encryption error") because encryption can return -EINPROGRESS in such setups, which should not be treated as an error. The error is also present in the BPF path (likely copied from there). Reported-by: Mallesham Jatharakonda <mallesham.jatharakonda@oneconvergence.com> Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Fixes: d10523d0b3d7 ("net/tls: free the record on encryption error") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10net/tls: avoid spurious decryption error with HW resyncJakub Kicinski
When device loses sync mid way through a record - kernel has to re-encrypt the part of the record which the device already decrypted to be able to decrypt and authenticate the record in its entirety. The re-encryption piggy backs on the decryption routine, but obviously because the partially decrypted record can't be authenticated crypto API returns an error which is then ignored by tls_device_reencrypt(). Commit 5c5ec6685806 ("net/tls: add TlsDecryptError stat") added a statistic to count decryption errors, this statistic can't be incremented when we see the expected re-encryption error. Move the inc to the caller. Reported-and-tested-by: David Beckett <david.beckett@netronome.com> Fixes: 5c5ec6685806 ("net/tls: add TlsDecryptError stat") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10net: bpf: Don't leak time wait and request socketsLorenz Bauer
It's possible to leak time wait and request sockets via the following BPF pseudo code:   sk = bpf_skc_lookup_tcp(...) if (sk) bpf_sk_release(sk) If sk->sk_state is TCP_NEW_SYN_RECV or TCP_TIME_WAIT the refcount taken by bpf_skc_lookup_tcp is not undone by bpf_sk_release. This is because sk_flags is re-used for other data in both kinds of sockets. The check !sock_flag(sk, SOCK_RCU_FREE) therefore returns a bogus result. Check that sk_flags is valid by calling sk_fullsock. Skip checking SOCK_RCU_FREE if we already know that sk is not a full socket. Fixes: edbf8c01de5a ("bpf: add skc_lookup_tcp helper") Fixes: f7355a6c0497 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200110132336.26099-1-lmb@cloudflare.com
2020-01-09skb: add helpers to allocate ext independently from sk_buffPaolo Abeni
Currently we can allocate the extension only after the skb, this change allows the user to do the opposite, will simplify allocation failure handling from MPTCP. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09tcp: Check for filled TCP option space before SACKMat Martineau
Update the SACK check to work with zero option space available, a case that's possible with MPTCP but not MD5+TS. Maintained only one conditional branch for insufficient SACK space. v1 -> v2: - Moves the check inside the SACK branch by taking recent SACK fix: 9424e2e7ad93 (tcp: md5: fix potential overestimation of TCP option space) in to account, but modifies it to work in MPTCP scenarios beyond the MD5+TS corner case. Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09tcp: Export TCP functions and ops structMat Martineau
MPTCP will make use of tcp_send_mss() and tcp_push() when sending data to specific TCP subflows. tcp_request_sock_ipvX_ops and ipvX_specific will be referenced during TCP subflow creation. Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com> Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09tcp: coalesce/collapse must respect MPTCP extensionsMat Martineau
Coalesce and collapse of packets carrying MPTCP extensions is allowed when the newer packet has no extension or the extensions carried by both packets are equal. This allows merging of TSO packet trains and even cross-TSO packets, and does not require any additional action when moving data into existing SKBs. v3 -> v4: - allow collapsing, under mptcp_skb_can_collapse() constraint v5 -> v6: - clarify MPTCP skb extensions must always be cleared at allocation time Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09mptcp: Add MPTCP to skb extensionsMat Martineau
Add enum value for MPTCP and update config dependencies v5 -> v6: - fixed '__unused' field size Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09tcp, ulp: Add clone operation to tcp_ulp_opsMat Martineau
If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are copied when the listener is cloned. Sometimes the clone is immediately deleted, which will invoke the release op on the clone and likely corrupt the listening socket's icsk_ulp_data. The clone operation is invoked immediately after the clone is copied and gives the ULP type an opportunity to set up the clone socket and its icsk_ulp_data. The MPTCP ULP clone will silently fallback to plain TCP on allocation failure, so 'clone()' does not need to return an error code. v6 -> v7: - move and rename ulp clone helper to make it inline-friendly v5 -> v6: - clarified MPTCP clone usage in commit message Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09sock: Make sk_protocol a 16-bit valueMat Martineau
Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so sizeof(struct sock) does not change. Also take care of BPF field access for sk_type/sk_protocol. Both of them are now outside the bitfield, so we can use load instructions without further shifting/masking. v5 -> v6: - update eBPF accessors, too (Intel's kbuild test robot) v2 -> v3: - keep 'sk_type' 2 bytes aligned (Eric) v1 -> v2: - preserve sk_pacing_shift as bit field (Eric) Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: bpf@vger.kernel.org Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09net: Make sock protocol value checks more specificMat Martineau
SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The limits have more to do with the those protocol definitions than they do with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use U8_MAX directly. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09net/x25: fix nonblocking connectMartin Schiller
This patch fixes 2 issues in x25_connect(): 1. It makes absolutely no sense to reset the neighbour and the connection state after a (successful) nonblocking call of x25_connect. This prevents any connection from being established, since the response (call accept) cannot be processed. 2. Any further calls to x25_connect() while a call is pending should simply return, instead of creating new Call Request (on different logical channels). This patch should also fix the "KASAN: null-ptr-deref Write in x25_connect" and "BUG: unable to handle kernel NULL pointer dereference in x25_connect" bugs reported by syzbot. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Reported-by: syzbot+429c200ffc8772bfe070@syzkaller.appspotmail.com Reported-by: syzbot+eec0c87f31a7c3b66f7b@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09flow_dissector: fix document for skb_flow_get_icmp_tciLi RongQing
using correct input parameter name to fix the below warning: net/core/flow_dissector.c:242: warning: Function parameter or member 'thoff' not described in 'skb_flow_get_icmp_tci' net/core/flow_dissector.c:242: warning: Excess function parameter 'toff' description in 'skb_flow_get_icmp_tci' Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09net/ncsi: Support for multi host mellanox cardVijay Khemka
Multi host Mellanox cards require MAC affinity to be set before receiving any config commands. All config commands should also have unicast address for source address in command header. Adding GMA and SMAF(Set Mac Affinity) for Mellanox card and call these in channel probe state machine if it is defined in device tree. Signed-off-by: Vijay Khemka <vijaykhemka@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09bpf/sockmap: Read psock ingress_msg before sk_receive_queueLingpeng Chen
Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue if not empty than psock->ingress_msg otherwise. If a FIN packet arrives and there's also some data in psock->ingress_msg, the data in psock->ingress_msg will be purged. It is always happen when request to a HTTP1.0 server like python SimpleHTTPServer since the server send FIN packet after data is sent out. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: Arika Chen <eaglesora@gmail.com> Suggested-by: Arika Chen <eaglesora@gmail.com> Signed-off-by: Lingpeng Chen <forrest0579@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
2020-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The ungrafting from PRIO bug fixes in net, when merged into net-next, merge cleanly but create a build failure. The resolution used here is from Petr Machata. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09bpf: Add BPF_FUNC_tcp_send_ack helperMartin KaFai Lau
Add a helper to send out a tcp-ack. It will be used in the later bpf_dctcp implementation that requires to send out an ack when the CE state changed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
2020-01-09bpf: tcp: Support tcp_congestion_ops in bpfMartin KaFai Lau
This patch makes "struct tcp_congestion_ops" to be the first user of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops in bpf. The BPF implemented tcp_congestion_ops can be used like regular kernel tcp-cc through sysctl and setsockopt. e.g. [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic net.ipv4.tcp_congestion_control = bpf_cubic There has been attempt to move the TCP CC to the user space (e.g. CCP in TCP). The common arguments are faster turn around, get away from long-tail kernel versions in production...etc, which are legit points. BPF has been the continuous effort to join both kernel and userspace upsides together (e.g. XDP to gain the performance advantage without bypassing the kernel). The recent BPF advancements (in particular BTF-aware verifier, BPF trampoline, BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc) possible in BPF. It allows a faster turnaround for testing algorithm in the production while leveraging the existing (and continue growing) BPF feature/framework instead of building one specifically for userspace TCP CC. This patch allows write access to a few fields in tcp-sock (in bpf_tcp_ca_btf_struct_access()). The optional "get_info" is unsupported now. It can be added later. One possible way is to output the info with a btf-id to describe the content. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-08ethtool: potential NULL dereference in strset_prepare_data()Dan Carpenter
Smatch complains that the NULL checking isn't done consistently: net/ethtool/strset.c:253 strset_prepare_data() error: we previously assumed 'dev' could be null (see line 233) It looks like there is a missing return on this path. Fixes: 71921690f974 ("ethtool: provide string sets with STRSET_GET request") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08ethtool: fix ->reply_size() error handlingDan Carpenter
The "ret < 0" comparison is never true because "ret" is still zero. Fixes: 728480f12442 ("ethtool: default handlers for GET requests") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08ethtool: fix a memory leak in ethnl_default_start()Dan Carpenter
If ethnl_default_parse() fails then we need to free a couple memory allocations before returning. Fixes: 728480f12442 ("ethtool: default handlers for GET requests") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08net: dsa: Get information about stacked DSA protocolFlorian Fainelli
It is possible to stack multiple DSA switches in a way that they are not part of the tree (disjoint) but the DSA master of a switch is a DSA slave of another. When that happens switch drivers may have to know this is the case so as to determine whether their tagging protocol has a remove chance of working. This is useful for specific switch drivers such as b53 where devices have been known to be stacked in the wild without the Broadcom tag protocol supporting that feature. This allows b53 to continue supporting those devices by forcing the disabling of Broadcom tags on the outermost switches if necessary. The get_tag_protocol() function is therefore updated to gain an additional enum dsa_tag_protocol argument which denotes the current tagging protocol used by the DSA master we are attached to, else DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08tipc: fix wrong connect() return codeTuong Lien
The current 'tipc_wait_for_connect()' function does a wait-loop for the condition 'sk->sk_state != TIPC_CONNECTING' to conclude if the socket connecting has done. However, when the condition is met, it returns '0' even in the case the connecting is actually failed, the socket state is set to 'TIPC_DISCONNECTING' (e.g. when the server socket has closed..). This results in a wrong return code for the 'connect()' call from user, making it believe that the connection is established and go ahead with building, sending a message, etc. but finally failed e.g. '-EPIPE'. This commit fixes the issue by changing the wait condition to the 'tipc_sk_connected(sk)', so the function will return '0' only when the connection is really established. Otherwise, either the socket 'sk_err' if any or '-ETIMEDOUT'/'-EINTR' will be returned correspondingly. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08tipc: fix link overflow issue at socket shutdownTuong Lien
When a socket is suddenly shutdown or released, it will reject all the unreceived messages in its receive queue. This applies to a connected socket too, whereas there is only one 'FIN' message required to be sent back to its peer in this case. In case there are many messages in the queue and/or some connections with such messages are shutdown at the same time, the link layer will easily get overflowed at the 'TIPC_SYSTEM_IMPORTANCE' backlog level because of the message rejections. As a result, the link will be taken down. Moreover, immediately when the link is re-established, the socket layer can continue to reject the messages and the same issue happens... The commit refactors the '__tipc_shutdown()' function to only send one 'FIN' in the situation mentioned above. For the connectionless case, it is unavoidable but usually there is no rejections for such socket messages because they are 'dest-droppable' by default. In addition, the new code makes the other socket states clear (e.g.'TIPC_LISTEN') and treats as a separate case to avoid misbehaving. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08devlink: add devink notification when reporter update health stateVikas Gupta
add a devlink notification when reporter update the health state. Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08devlink: add support for reporter recovery completionVikas Gupta
It is possible that a reporter recovery completion do not finish successfully when recovery is triggered via devlink_health_reporter_recover as recovery could be processed in different context. In such scenario an error is returned by driver when recover hook is invoked and successful recovery completion is intimated later. Expose devlink recover done API to update recovery stats. Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Missing netns context in arp_tables, from Florian Westphal. 2) Underflow in flowtable reference counter, from wenxu. 3) Fix incorrect ethernet destination address in flowtable offload, from wenxu. 4) Check for status of neighbour entry, from wenxu. 5) Fix NAT port mangling, from wenxu. 6) Unbind callbacks from destroy path to cleanup hardware properly on flowtable removal. 7) Fix missing casting statistics timestamp, add nf_flowtable_time_stamp and use it. 8) NULL pointer exception when timeout argument is null in conntrack dccp and sctp protocol helpers, from Florian Westphal. 9) Possible nul-dereference in ipset with IPSET_ATTR_LINENO, also from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>