summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-02-10batman-adv: Convert batadv_orig_node_vlan to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_hard_iface to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_neigh_node to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_orig_ifinfo to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_neigh_ifinfo to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_tt_orig_list_entry to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_tvlv_handler to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_tvlv_container to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_dat_entry to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_nc_path to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_nc_node to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_bla_claim to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_bla_backbone_gw to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_softif_vlan to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_gw_node to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Convert batadv_hardif_neigh_node to krefSven Eckelmann
batman-adv uses a self-written reference implementation which is just based on atomic_t. This is less obvious when reading the code than kref and therefore increases the change that the reference counting will be missed. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Add lockdep assert for container_list_lockSven Eckelmann
The batadv_tvlv_container* functions state in their kernel-doc that they require tvlv.container_list_lock. Add an assert to automatically detect when this might have been ignored by the caller. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: add seqno maximum age and protection start flag parametersSimon Wunderlich
To allow future use of the window protected function with different maximum sequence numbers, add a parameter to set this value which was previously hardcoded. Another parameter added for future use is a flag to return whether the protection window has started. While at it, also fix the kerneldoc. Signed-off-by: Simon Wunderlich <simon@open-mesh.com> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10batman-adv: Drop reference to netdevice on last referenceSven Eckelmann
The references to the network device should be dropped inside the release function for batadv_hard_iface similar to what is done with the batman-adv internal datastructures. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-10vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devicesDavid Wragg
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could transmit vxlan packets of any size, constrained only by the ability to send out the resulting packets. 4.3 introduced netdevs corresponding to tunnel vports. These netdevs have an MTU, which limits the size of a packet that can be successfully encapsulated. The default MTU values are low (1500 or less), which is awkwardly small in the context of physical networks supporting jumbo frames, and leads to a conspicuous change in behaviour for userspace. Instead, set the MTU on openvswitch-created netdevs to be the relevant maximum (i.e. the maximum IP packet size minus any relevant overhead), effectively restoring the behaviour prior to 4.3. Signed-off-by: David Wragg <david@weave.works> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09flow_dissector: Fix unaligned access in __skb_flow_dissector when used by ↵Alexander Duyck
eth_get_headlen This patch fixes an issue with unaligned accesses when using eth_get_headlen on a page that was DMA aligned instead of being IP aligned. The fact is when trying to check the length we don't need to be looking at the flow label so we can reorder the checks to first check if we are supposed to gather the flow label and then make the call to actually get it. v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can cause us to check for the flow label. Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09packet: tpacket_snd gso and checksum offloadWillem de Bruijn
Support socket option PACKET_VNET_HDR together with PACKET_TX_RING. When enabled, a struct virtio_net_hdr is expected to precede the data in the ring. The vnet option must be set before the ring is created. The implementation reuses the existing skb_copy_bits code that is used when dev->hard_header_len is non-zero. Move this ll_header check to before the skb alloc and combine it with a test for vnet_hdr->hdr_len. Allocate and copy the max of the two. Verified with test program at github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09packet: parse tpacket header before skb allocWillem de Bruijn
GSO packet headers must be stored in the linear skb segment. Move tpacket header parsing before sock_alloc_send_skb. The GSO follow-on patch will later increase the skb linear argument to sock_alloc_send_skb if needed for large packets. The header parsing code does not require an allocated skb, so is safe to move. Later pass to tpacket_fill_skb the computed data start and length. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09packet: vnet_hdr support for tpacket_rcvWillem de Bruijn
Support socket option PACKET_VNET_HDR together with PACKET_RX_RING. When enabled, a struct virtio_net_hdr will precede the data in the packet ring slots. Verified with test program at github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c pkt: 1454269209.798420 len=5066 vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off csum: start=34 off=16 eth: proto=0x800 ip: src=<masked> dst=<masked> proto=6 len=5052 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09packet: move vnet_hdr code to helper functionsWillem de Bruijn
packet_snd and packet_rcv support virtio net headers for GSO. Move this logic into helper functions to be able to reuse it in tpacket_snd and tpacket_rcv. This is a straighforward code move with one exception. Instead of creating and passing a separate gso_type variable, reuse vnet_hdr.gso_type after conversion from virtio to kernel gso type. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09sctp: translate network order to host order when users get a hmacidXin Long
Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") corrected the hmacid byte-order when setting a hmacid. but the same issue also exists on getting a hmacid. We fix it by changing hmacids to host order when users get them with getsockopt. Fixes: Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09bridge: mdb: Passing the port-group pointer to br_mdb moduleElad Raz
Passing the port-group to br_mdb in order to allow direct access to the structure. br_mdb will later use the structure to reflect HW reflection status via "state" variable. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09bridge: mdb: Separate br_mdb_entry->state from net_bridge_port_group->stateElad Raz
Change net_bridge_port_group 'state' member to 'flags' and define new set of flags internal to the kernel. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09net:Add sysctl_max_skb_fragsHans Westgaard Ry
Devices may have limits on the number of fragments in an skb they support. Current codebase uses a constant as maximum for number of fragments one skb can hold and use. When enabling scatter/gather and running traffic with many small messages the codebase uses the maximum number of fragments and may thereby violate the max for certain devices. The patch introduces a global variable as max number of fragments. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09tcp: do not drop syn_recv on all icmp reportsEric Dumazet
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets were leading to RST. This is of course incorrect. A specific list of ICMP messages should be able to drop a SYN_RECV. For instance, a REDIRECT on SYN_RECV shall be ignored, as we do not hold a dst per SYN_RECV pseudo request. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751 Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08ipv6: fix a lockdep splatEric Dumazet
Silence lockdep false positive about rcu_dereference() being used in the wrong context. First one should use rcu_dereference_protected() as we own the spinlock. Second one should be a normal assignation, as no barrier is needed. Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08unix: correctly track in-flight fds in sending process user_structHannes Frederic Sowa
The commit referenced in the Fixes tag incorrectly accounted the number of in-flight fds over a unix domain socket to the original opener of the file-descriptor. This allows another process to arbitrary deplete the original file-openers resource limit for the maximum of open files. Instead the sending processes and its struct cred should be credited. To do so, we add a reference counted struct user_struct pointer to the scm_fp_list and use it to account for the number of inflight unix fds. Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-08netfilter: nft_counter: fix erroneous return valuesAnton Protopopov
The nft_counter_init() and nft_counter_clone() functions should return negative error value -ENOMEM instead of positive ENOMEM. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-08netfilter: tee: select NF_DUP_IPV6 unconditionallyArnd Bergmann
The NETFILTER_XT_TARGET_TEE option selects NF_DUP_IPV6 whenever IP6_NF_IPTABLES is enabled, and it ensures that it cannot be builtin itself if NF_CONNTRACK is a loadable module, as that is a dependency for NF_DUP_IPV6. However, NF_DUP_IPV6 can be enabled even if IP6_NF_IPTABLES is turned off, and it only really depends on IPV6. With the current check in tee_tg6, we call nf_dup_ipv6() whenever NF_DUP_IPV6 is enabled. This can however be a loadable module which is unreachable from a built-in xt_TEE: net/built-in.o: In function `tee_tg6': :(.text+0x67728): undefined reference to `nf_dup_ipv6' The bug was originally introduced in the split of the xt_TEE module into separate modules for ipv4 and ipv6, and two patches tried to fix it unsuccessfully afterwards. This is a revert of the the first incorrect attempt to fix it, going back to depending on IPV6 as the dependency, and we adapt the 'select' condition accordingly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") Fixes: 116984a316c3 ("netfilter: xt_TEE: use IS_ENABLED(CONFIG_NF_DUP_IPV6)") Fixes: 74ec4d55c4d2 ("netfilter: fix xt_TEE and xt_TPROXY dependencies") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-08netfilter: nfnetlink: correctly validate length of batch messagesPhil Turnbull
If nlh->nlmsg_len is zero then an infinite loop is triggered because 'skb_pull(skb, msglen);' pulls zero bytes. The calculation in nlmsg_len() underflows if 'nlh->nlmsg_len < NLMSG_HDRLEN' which bypasses the length validation and will later trigger an out-of-bound read. If the length validation does fail then the malformed batch message is copied back to userspace. However, we cannot do this because the nlh->nlmsg_len can be invalid. This leads to an out-of-bounds read in netlink_ack: [ 41.455421] ================================================================== [ 41.456431] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr ffff880119e79340 [ 41.456431] Read of size 4294967280 by task a.out/987 [ 41.456431] ============================================================================= [ 41.456431] BUG kmalloc-512 (Not tainted): kasan: bad access detected [ 41.456431] ----------------------------------------------------------------------------- ... [ 41.456431] Bytes b4 ffff880119e79310: 00 00 00 00 d5 03 00 00 b0 fb fe ff 00 00 00 00 ................ [ 41.456431] Object ffff880119e79320: 20 00 00 00 10 00 05 00 00 00 00 00 00 00 00 00 ............... [ 41.456431] Object ffff880119e79330: 14 00 0a 00 01 03 fc 40 45 56 11 22 33 10 00 05 .......@EV."3... [ 41.456431] Object ffff880119e79340: f0 ff ff ff 88 99 aa bb 00 14 00 0a 00 06 fe fb ................ ^^ start of batch nlmsg with nlmsg_len=4294967280 ... [ 41.456431] Memory state around the buggy address: [ 41.456431] ffff880119e79400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] ffff880119e79480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 41.456431] >ffff880119e79500: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ^ [ 41.456431] ffff880119e79580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 41.456431] ffff880119e79600: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb [ 41.456431] ================================================================== Fix this with better validation of nlh->nlmsg_len and by setting NFNL_BATCH_FAILURE if any batch message fails length validation. CAP_NET_ADMIN is required to trigger the bugs. Fixes: 9ea2aa8b7dba ("netfilter: nfnetlink: validate nfnetlink header from batch") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-02-08tcp: Fix syncookies sysctl default.David S. Miller
Unintentionally the default was changed to zero, fix that. Fixes: 12ed8244ed ("ipv4: Namespaceify tcp syncookies sysctl knob") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_notsent_lowat sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_fin_timeout sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_orphan_retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries2 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp_retries1 sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp reordering sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syncookies sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp synack retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07ipv4: Namespaceify tcp syn retries sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Antonio Quartulli says: ==================== This batch of patches includes a number of corrections and improvements for our kernel-doc. These changes also make sure that our doc is now properly processed by the kernel-doc parsing tool. Other than that you have a patch updating all the copyright lines to 2016 and another patch switching the URLs in our readme, Kconfig and MAINTAINERS file from "http" to "https". Both by Sven Eckelmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: tcp_cong_control helperYuchung Cheng
Refactor and consolidate cwnd and rate updates into a new function tcp_cong_control(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: make congestion control more robust against reorderingYuchung Cheng
This change enables congestion control to update cwnd based on not only packet cumulatively acked but also packets delivered out-of-order. This makes congestion control robust against packet reordering because it may raise cwnd as long as packets are being delivered once reordering has been detected (i.e., it only cares the amount of packets delivered, not the ordering among them). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: refactor pkts acked accountingYuchung Cheng
A small refactoring that gets number of packets cumulatively acked from tcp_clean_rtx_queue() directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07tcp: new delivery accountingYuchung Cheng
This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>