summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2018-05-02sctp: use the old asoc when making the cookie-ack chunk in dupcook_dXin Long
When processing a duplicate cookie-echo chunk, for case 'D', sctp will not process the param from this chunk. It means old asoc has nothing to be updated, and the new temp asoc doesn't have the complete info. So there's no reason to use the new asoc when creating the cookie-ack chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk can not be set with auth, and it will definitely be dropped by peer. This issue is there since very beginning, and we fix it by using the old asoc instead. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02sctp: init active key for the new asoc in dupcook_a and dupcook_bXin Long
When processing a duplicate cookie-echo chunk, for case 'A' and 'B', after sctp_process_init for the new asoc, if auth is enabled for the cookie-ack chunk, the active key should also be initialized. Otherwise, the cookie-ack chunk made later can not be set with auth shkey properly, and a crash can even be caused by this, as after Commit 1b1e0bc99474 ("sctp: add refcnt support for sh_key"), sctp needs to hold the shkey when making control chunks. Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02tcp_bbr: fix to zero idle_restart only upon S/ACKed dataNeal Cardwell
Previously the bbr->idle_restart tracking was zeroing out the bbr->idle_restart bit upon ACKs that did not SACK or ACK anything, e.g. receiving incoming data or receiver window updates. In such situations BBR would forget that this was a restart-from-idle situation, and if the min_rtt had expired it would unnecessarily enter PROBE_RTT (even though we were actually restarting from idle but had merely forgotten that fact). The fix is simple: we need to remember we are restarting from idle until we receive a S/ACK for some data (a S/ACK for the first flight of data we send as we are restarting). This commit is a stable candidate for kernels back as far as 4.9. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02udp: Complement partial checksum for GSO packetSean Tranchetti
Using the udp_v4_check() function to calculate the pseudo header for the newly segmented UDP packets results in assigning the complement of the value to the UDP header checksum field. Always undo the complement the partial checksum value in order to match the case where GSO is not used on the UDP transmit path. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net/tls: Don't recursively call push_record during tls_write_space callbacksDave Watson
It is reported that in some cases, write_space may be called in do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again: [ 660.468802] ? do_tcp_sendpages+0x8d/0x580 [ 660.468826] ? tls_push_sg+0x74/0x130 [tls] [ 660.468852] ? tls_push_record+0x24a/0x390 [tls] [ 660.468880] ? tls_write_space+0x6a/0x80 [tls] ... tls_push_sg already does a loop over all sending sg's, so ignore any tls_write_space notifications until we are done sending. We then have to call the previous write_space to wake up poll() waiters after we are done with the send loop. Reported-by: Andre Tomt <andre@tomt.net> Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01tcp: send in-queue bytes in cmsg upon readSoheil Hassas Yeganeh
Applications with many concurrent connections, high variance in receive queue length and tight memory bounds cannot allocate worst-case buffer size to drain sockets. Knowing the size of receive queue length, applications can optimize how they allocate buffers to read from the socket. The number of bytes pending on the socket is directly available through ioctl(FIONREAD/SIOCINQ) and can be approximated using getsockopt(MEMINFO) (rmem_alloc includes skb overheads in addition to application data). But, both of these options add an extra syscall per recvmsg. Moreover, ioctl(FIONREAD/SIOCINQ) takes the socket lock. Add the TCP_INQ socket option to TCP. When this socket option is set, recvmsg() relays the number of bytes available on the socket for reading to the application via the TCP_CM_INQ control message. Calculate the number of bytes after releasing the socket lock to include the processed backlog, if any. To avoid an extra branch in the hot path of recvmsg() for this new control message, move all cmsg processing inside an existing branch for processing receive timestamps. Since the socket lock is not held when calculating the size of receive queue, TCP_INQ is a hint. For example, it can overestimate the queue size by one byte, if FIN is received. With this method, applications can start reading from the socket using a small buffer, and then use larger buffers based on the remaining data when needed. V3 change-log: As suggested by David Miller, added loads with barrier to check whether we have multiple threads calling recvmsg in parallel. When that happens we lock the socket to calculate inq. V4 change-log: Removed inline from a static function. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net: core: Inline netdev_features_size_check()Florian Fainelli
We do not require this inline function to be used in multiple different locations, just inline it where it gets used in register_netdevice(). Suggested-by: David Miller <davem@davemloft.net> Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01ipv6: Allow non-gateway ECMP for IPv6Thomas Winter
It is valid to have static routes where the nexthop is an interface not an address such as tunnels. For IPv4 it was possible to use ECMP on these routes but not for IPv6. Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> Cc: David Ahern <dsahern@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01udp: disable gso with no_check_txWillem de Bruijn
Syzbot managed to send a udp gso packet without checksum offload into the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This triggered the skb_warn_bad_offload. RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658 skb_gso_segment include/linux/netdevice.h:4038 [inline] validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120 __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577 dev_queue_xmit+0x17/0x20 net/core/dev.c:3618 UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to catch and fail in this case. After the integrity checks jump directly to the CHECKSUM_PARTIAL case to avoid reading the no_check_tx flags again (a TOCTTOU race). Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01ethtool: fix a potential missing-check bugWenwen Wang
In ethtool_get_rxnfc(), the object "info" is firstly copied from user-space. If the FLOW_RSS flag is set in the member field flow_type of "info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from user-space because FLOW_RSS is newer and has new definition, as mentioned in the comment. However, given that the user data resides in user-space, a malicious user can race to change the data after the first copy. By doing so, the user can inject inconsistent data. For example, in the second copy, the FLOW_RSS flag could be cleared in the field flow_type of "info". In the following execution, "info" will be used in the function ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected information leakage since ops->get_rxnfc() will prepare various types of data according to flow_type, and the prepared data will be eventually copied to user-space. This inconsistent data may also cause undefined behaviors based on how ops->get_rxnfc() is implemented. This patch simply re-verifies the flow_type field of "info" after the second copy. If the value is not as expected, an error code will be returned. Signed-off-by: Wenwen Wang <wang6495@umn.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01cls_flower: Support multiple masks per priorityPaul Blakey
Currently flower doesn't support inserting filters with different masks on a single priority, even if the actual flows (key + mask) inserted aren't overlapping, as with the use case of offloading openvswitch datapath flows. Instead one must go up one level, and assign different priorities for each mask, which will create a different flower instances. This patch opens flower to support more than one mask per priority, and a single flower instance. It does so by adding another hash table on top of the existing one which will store the different masks, and the filters that share it. The user is left with the responsibility of ensuring non overlapping flows, otherwise precedence is not guaranteed. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01xprtrdma: Fix list corruption / DMAR errors during MR recoveryChuck Lever
The ro_release_mr methods check whether mr->mr_list is empty. Therefore, be sure to always use list_del_init when removing an MR linked into a list using that field. Otherwise, when recovering from transport failures or device removal, list corruption can result, or MRs can get mapped or unmapped an odd number of times, resulting in IOMMU-related failures. In general this fix is appropriate back to v4.8. However, code changes since then make it impossible to apply this patch directly to stable kernels. The fix would have to be applied by hand or reworked for kernels earlier than v4.16. Backport guidance -- there are several cases: - When creating an MR, initialize mr_list so that using list_empty on an as-yet-unused MR is safe. - When an MR is being handled by the remote invalidation path, ensure that mr_list is reinitialized when it is removed from rl_registered. - When an MR is being handled by rpcrdma_destroy_mrs, it is removed from mr_all, but it may still be on an rl_registered list. In that case, the MR needs to be removed from that list before being released. - Other cases are covered by using list_del_init in rpcrdma_mr_pop. Fixes: 9d6b04097882 ('xprtrdma: Place registered MWs on a ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-01tcp: fix TCP_REPAIR_QUEUE bound checkingEric Dumazet
syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out() with following C-repro : socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242 setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0 writev(3, [{"\270", 1}], 1) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0 writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144 The 3rd system call looks odd : setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 This patch makes sure bound checking is using an unsigned compare. Fixes: ee9952831cfd ("tcp: Initial repair mode") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01ipv6: fix uninit-value in ip6_multipath_l3_keys()Eric Dumazet
syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(), root caused to a bad assumption of ICMP header being already pulled in skb->head ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug. BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683 ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884 ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:288 [inline] ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 call_write_iter include/linux/fs.h:1782 [inline] new_sync_write fs/read_write.c:469 [inline] __vfs_write+0x7fb/0x9f0 fs/read_write.c:482 vfs_write+0x463/0x8d0 fs/read_write.c:544 SYSC_write+0x172/0x360 fs/read_write.c:589 SyS_write+0x55/0x80 fs/read_write.c:581 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 23aebdacb05d ("ipv6: Compute multipath hash for ICMP errors from offending packet") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01sctp: add sctp_make_op_error_limited and reuse inner functionsMarcelo Ricardo Leitner
The idea is quite similar to the old functions, but note that the _fixed function wasn't "fixed" as in that it would generate a packet with a fixed size, but rather limited/bounded to PMTU. Also, now with sctp_mtu_payload(), we have a more accurate limit. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01sctp: allow sctp_init_cause to return errorsMarcelo Ricardo Leitner
And do so if the skb doesn't have enough space for the payload. This is a preparation for the next patch. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net/tls: Add generic NIC offload infrastructureIlya Lesokhin
This patch adds a generic infrastructure to offload TLS crypto to a network device. It enables the kernel TLS socket to skip encryption and authentication operations on the transmit side of the data path. Leaving those computationally expensive operations to the NIC. The NIC offload infrastructure builds TLS records and pushes them to the TCP layer just like the SW KTLS implementation and using the same API. TCP segmentation is mostly unaffected. Currently the only exception is that we prevent mixed SKBs where only part of the payload requires offload. In the future we are likely to add a similar restriction following a change cipher spec record. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. The offloaded implementation builds "plaintext TLS record", those records contain plaintext instead of ciphertext and place holder bytes instead of authentication tags. 2. The offloaded implementation maintains a mapping from TCP sequence number to TLS records. Thus given a TCP SKB sent from a NIC offloaded TLS socket, we can use the tls NIC offload infrastructure to obtain enough context to encrypt the payload of the SKB. A TLS record is released when the last byte of the record is ack'ed, this is done through the new icsk_clean_acked callback. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC assumes that packets from each offloaded stream are sent as plaintext and in-order. It keeps track of the TLS records in the TCP stream. When a packet marked for offload is transmitted, the NIC encrypts the payload in-place and puts authentication tags in the relevant place holders. The responsibility for handling out-of-order packets (i.e. TCP retransmission, qdisc drops) falls on the netdev driver. The netdev driver keeps track of the expected TCP SN from the NIC's perspective. If the next packet to transmit matches the expected TCP SN, the driver advances the expected TCP SN, and transmits the packet with TLS offload indication. If the next packet to transmit does not match the expected TCP SN. The driver calls the TLS layer to obtain the TLS record that includes the TCP of the packet for transmission. Using this TLS record, the driver posts a work entry on the transmit queue to reconstruct the NIC TLS state required for the offload of the out-of-order packet. It updates the expected TCP SN accordingly and transmits the now in-order packet. The same queue is used for packet transmission and TLS context reconstruction to avoid the need for flushing the transmit queue before issuing the context reconstruction request. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net/tls: Split conf to rx + txBoris Pismenny
In TLS inline crypto, we can have one direction in software and another in hardware. Thus, we split the TLS configuration to separate structures for receive and transmit. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net: Add TLS TX offload featuresIlya Lesokhin
This patch adds a netdev feature to configure TLS TX offloads. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net: Add Software fallback infrastructure for socket dependent offloadsIlya Lesokhin
With socket dependent offloads we rely on the netdev to transform the transmitted packets before sending them to the wire. When a packet from an offloaded socket is rerouted to a different device we need to detect it and do the transformation in software. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01net: Rename and export copy_skb_headerIlya Lesokhin
copy_skb_header is renamed to skb_copy_header and exported. Exposing this function give more flexibility in copying SKBs. skb_copy and skb_copy_expand do not give enough control over which parts are copied. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01tcp: Add clean acked data hookIlya Lesokhin
Called when a TCP segment is acknowledged. Could be used by application protocols who hold additional metadata associated with the stream data. This is required by TLS device offload to release metadata associated with acknowledged TLS records. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01change the comment of vti6_ioctlSun Lianwen
The comment of vti6_ioctl() is wrong. which use vti6_tnl_ioctl instead of vti6_ioctl. Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-30net: bridge: Publish bridge accessor functionsPetr Machata
Add a couple new functions to allow querying FDB and vlan settings of a bridge. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-30ipv6: sr: extract the right key values for "seg6_make_flowlabel"Ahmed Abdelsalam
The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the flowlabel from a given skb. It relies on skb_get_hash() which eventually calls __skb_flow_dissect() to extract the flow_keys struct values from the skb. In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(), skb_reset_network_header(), and skb_mac_header_rebuild() will results in flow_keys struct of all key values set to zero. This patch calls seg6_make_flowlabel() before resetting the headers of skb to get the right key values. Extracted Key values are based on the type inner packet as follows: 1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet. 2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port 3) L2 traffic: depends on what kind of traffic carried into the L2 frame. IPv6 and IPv4 traffic works as discussed 1) and 2) Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic 10.100.1.100: 47302 > 30.0.0.2: 5001 00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02 00000020: 0a 64 01 64 fc00:a1:a > b2::2 00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1 00000030: 00 00 00 00 00 00 00 00 00 00 00 0a Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-30erspan: auto detect truncated packets.William Tu
Currently the truncated bit is set only when the mirrored packet is larger than mtu. For certain cases, the packet might already been truncated before sending to the erspan tunnel. In this case, the patch detect whether the IP header's total length is larger than the actual skb->len. If true, this indicated that the mirrored packet is truncated and set the erspan truncate bit. I tested the patch using bpf_skb_change_tail helper function to shrink the packet size and send to erspan tunnel. Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29net: core: Assert the size of netdev_featres_tFlorian Fainelli
We have about 53 netdev_features_t bits defined and counting, add a build time check to catch when an u64 type will not be enough and we will have to convert that to a bitmap. This is done in register_netdevice() for convenience. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hashAlexander Duyck
I am dropping the export of __skb_tx_hash as after my patches nobody is using it outside of the net/core/dev.c file. In addition I am renaming and repurposing it to just be a static declaration of skb_tx_hash since that was the only user for it at this point. By doing this the compiler can inline it into __netdev_pick_tx as that will improve performance. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receiveEric Dumazet
When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(&current->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29bridge: check iface upper dev when setting master via ioctlHangbin Liu
When we set a bond slave's master to bridge via ioctl, we only check the IFF_BRIDGE_PORT flag. Although we will find the slave's real master at netdev_master_upper_dev_link() later, it already does some settings and allocates some resources. It would be better to return as early as possible. v1 -> v2: use netdev_master_upper_dev_get() instead of netdev_has_any_upper_dev() to check if we have a master, because not all upper devs are masters, e.g. vlan device. Reported-by: syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27udp: remove stray export symbolWillem de Bruijn
UDP GSO needs to export __udp_gso_segment to call it from ipv6. I accidentally exported static ipv4 function __udp4_gso_segment. Remove that EXPORT_SYMBOL_GPL. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net: support compat 64-bit time in {s,g}etsockoptLance Richardson
For the x32 ABI, struct timeval has two 64-bit fields. However the kernel currently interprets the user-space values used for the SO_RCVTIMEO and SO_SNDTIMEO socket options as having a pair of 32-bit fields. When the seconds portion of the requested timeout is less than 2**32, the seconds portion of the effective timeout is correct but the microseconds portion is zero. When the seconds portion of the requested timeout is zero and the microseconds portion is non-zero, the kernel interprets the timeout as zero (never timeout). Fix by using 64-bit time for SO_RCVTIMEO/SO_SNDTIMEO as required for the ABI. The code included below demonstrates the problem. Results before patch: $ gcc -m64 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 2.008181 seconds send time: 2.015985 seconds $ gcc -m32 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 2.016763 seconds send time: 2.016062 seconds $ gcc -mx32 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 1.007239 seconds send time: 1.023890 seconds Results after patch: $ gcc -m64 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.010062 seconds send time: 2.015836 seconds $ gcc -m32 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.013974 seconds send time: 2.015981 seconds $ gcc -mx32 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.030257 seconds send time: 2.013383 seconds #include <stdio.h> #include <stdlib.h> #include <sys/socket.h> #include <sys/types.h> #include <sys/time.h> void checkrc(char *str, int rc) { if (rc >= 0) return; perror(str); exit(1); } static char buf[1024]; int main(int argc, char **argv) { int rc; int socks[2]; struct timeval tv; struct timeval start, end, delta; rc = socketpair(AF_UNIX, SOCK_STREAM, 0, socks); checkrc("socketpair", rc); /* set timeout to 1.999999 seconds */ tv.tv_sec = 1; tv.tv_usec = 999999; rc = setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof tv); rc = setsockopt(socks[0], SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof tv); checkrc("setsockopt", rc); /* measure actual receive timeout */ gettimeofday(&start, NULL); rc = recv(socks[0], buf, sizeof buf, 0); gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("recv time: %ld.%06ld seconds\n", (long)delta.tv_sec, (long)delta.tv_usec); /* fill send buffer */ do { rc = send(socks[0], buf, sizeof buf, 0); } while (rc > 0); /* measure actual send timeout */ gettimeofday(&start, NULL); rc = send(socks[0], buf, sizeof buf, 0); gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("send time: %ld.%06ld seconds\n", (long)delta.tv_sec, (long)delta.tv_usec); exit(0); } Fixes: 515c7af85ed9 ("x32: Use compat shims for {g,s}etsockopt") Reported-by: Gopal RajagopalSai <gopalsr83@gmail.com> Signed-off-by: Lance Richardson <lance.richardson.net@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net: qrtr: Expose tunneling endpoint to user spaceBjorn Andersson
This implements a misc character device named "qrtr-tun" for the purpose of allowing user space applications to implement endpoints in the qrtr network. This allows more advanced (and dynamic) testing of the qrtr code as well as opens up the ability of tunneling qrtr over a network or USB link. Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: allow unsetting sockopt MAXSEGMarcelo Ricardo Leitner
RFC 6458 Section 8.1.16 says that setting MAXSEG as 0 means that the user is not limiting it, and not that it should set to the *current* maximum, as we are doing. This patch thus allow setting it as 0, effectively removing the user limit. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: consider idata chunks when setting SCTP_MAXSEGMarcelo Ricardo Leitner
When setting SCTP_MAXSEG sock option, it should consider which kind of data chunk is being used if the asoc is already available, so that the limit better reflect reality. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: honor PMTU_DISABLED when handling icmpMarcelo Ricardo Leitner
sctp_sendmsg() could trigger PMTU updates even when PMTU_DISABLED was set, as pmtu_pending could be set unconditionally during icmp handling if the socket was in use by the application. This patch fixes it by checking for PMTU_DISABLED when handling such deferred updates. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: re-use sctp_transport_pmtu in sctp_transport_routeMarcelo Ricardo Leitner
sctp_transport_route currently is very similar to sctp_transport_pmtu plus a few other bits. This patch reuses sctp_transport_pmtu in sctp_transport_route and removes the duplicated code. Also, as all calls to sctp_transport_route were forcing the dst release before calling it, let's just include such release too. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: remove sctp_transport_pmtu_checkMarcelo Ricardo Leitner
We are now keeping the MTU information synced between asoc, transport and dst, which makes the check at sctp_packet_config() not needed anymore. As it was the sole caller to this function, lets remove it. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: introduce sctp_dst_mtuMarcelo Ricardo Leitner
Which makes sure that the MTU respects the minimum value of SCTP_DEFAULT_MINSEGMENT and that it is correctly aligned. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: remove sctp_assoc_pending_pmtuMarcelo Ricardo Leitner
No need for this helper. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: introduce sctp_assoc_update_frag_pointMarcelo Ricardo Leitner
and avoid the open-coded versions of it. Now sctp_datamsg_from_user can just re-use asoc->frag_point as it will always be updated. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: introduce sctp_mtu_payloadMarcelo Ricardo Leitner
When given a MTU, this function calculates how much payload we can carry on it. Without a MTU, it calculates the amount of header overhead we have. So that when we have extra overhead, like the one added for IP options on SELinux patches, it is easier to handle it. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: introduce sctp_assoc_set_pmtuMarcelo Ricardo Leitner
All changes to asoc PMTU should now go through this wrapper, making it easier to track them and to do other actions upon it. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: remove an if() that is always trueMarcelo Ricardo Leitner
As noticed by Xin Long, the if() here is always true as PMTU can never be 0. Reported-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27sctp: move transport pathmtu calc away of sctp_assoc_add_peerMarcelo Ricardo Leitner
There was only one case that sctp_assoc_add_peer couldn't handle, which is when SPP_PMTUD_DISABLE is set and pathmtu not initialized. So add this situation to sctp_transport_route and reuse what was already in there. Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27tcp: remove mss check in tcp_select_initial_window()Wei Wang
In tcp_select_initial_window(), we only set rcv_wnd to tcp_default_init_rwnd() if current mss > (1 << wscale). Otherwise, rcv_wnd is kept at the full receive space of the socket which is a value way larger than tcp_default_init_rwnd(). With larger initial rcv_wnd value, receive buffer autotuning logic takes longer to kick in and increase the receive buffer. In a TCP throughput test where receiver has rmem[2] set to 125MB (wscale is 11), we see the connection gets recvbuf limited at the beginning of the connection and gets less throughput overall. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net/smc: handle sockopt TCP_DEFER_ACCEPTUrsula Braun
If sockopt TCP_DEFER_ACCEPT is set, the accept is delayed till data is available. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net/smc: sockopts TCP_NODELAY and TCP_CORKUrsula Braun
Setting sockopt TCP_NODELAY or resetting sockopt TCP_CORK triggers data transfer. For a corked SMC socket RDMA writes are deferred, if there is still sufficient send buffer space available. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net/smc: handle sockopts forcing fallbackUrsula Braun
Several TCP sockopts do not work for SMC. One example are the TCP_FASTOPEN sockopts, since SMC-connection setup is based on the TCP three-way-handshake. If the SMC socket is still in state SMC_INIT, such sockopts trigger fallback to TCP. Otherwise an error is returned. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27net/smc: fix structure sizeKarsten Graul
The struct smc_cdc_msg must be defined as packed so the size is 44 bytes. And change the structure size check so sizeof is checked. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>