summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2014-09-29net: tcp: add DCTCP congestion control algorithmDaniel Borkmann
This work adds the DataCenter TCP (DCTCP) congestion control algorithm [1], which has been first published at SIGCOMM 2010 [2], resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more recently as an informational IETF draft available at [4]). DCTCP is an enhancement to the TCP congestion control algorithm for data center networks. Typical data center workloads are i.e. i) partition/aggregate (queries; bursty, delay sensitive), ii) short messages e.g. 50KB-1MB (for coordination and control state; delay sensitive), and iii) large flows e.g. 1MB-100MB (data update; throughput sensitive). DCTCP has therefore been designed for such environments to provide/achieve the following three requirements: * High burst tolerance (incast due to partition/aggregate) * Low latency (short flows, queries) * High throughput (continuous data updates, large file transfers) with commodity, shallow buffered switches The basic idea of its design consists of two fundamentals: i) on the switch side, packets are being marked when its internal queue length > threshold K (K is chosen so that a large enough headroom for marked traffic is still available in the switch queue); ii) the sender/host side maintains a moving average of the fraction of marked packets, so each RTT, F is being updated as follows: F := X / Y, where X is # of marked ACKs, Y is total # of ACKs alpha := (1 - g) * alpha + g * F, where g is a smoothing constant The resulting alpha (iow: probability that switch queue is congested) is then being used in order to adaptively decrease the congestion window W: W := (1 - (alpha / 2)) * W The means for receiving marked packets resp. marking them on switch side in DCTCP is the use of ECN. RFC3168 describes a mechanism for using Explicit Congestion Notification from the switch for early detection of congestion, rather than waiting for segment loss to occur. However, this method only detects the presence of congestion, not the *extent*. In the presence of mild congestion, it reduces the TCP congestion window too aggressively and unnecessarily affects the throughput of long flows [4]. DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion, rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate [4], thus it can derive multibit feedback from the information present in the single-bit sequence of marks in its control law. And thus act in *proportion* to the extent of congestion, not its *presence*. Switches therefore set the Congestion Experienced (CE) codepoint in packets when internal queue lengths exceed threshold K. Resulting, DCTCP delivers the same or better throughput than normal TCP, while using 90% less buffer space. It was found in [2] that DCTCP enables the applications to handle 10x the current background traffic, without impacting foreground traffic. Moreover, a 10x increase in foreground traffic did not cause any timeouts, and thus largely eliminates TCP incast collapse problems. The algorithm itself has already seen deployments in large production data centers since then. We did a long-term stress-test and analysis in a data center, short summary of our TCP incast tests with iperf compared to cubic: This test measured DCTCP throughput and latency and compared it with CUBIC throughput and latency for an incast scenario. In this test, 19 senders sent at maximum rate to a single receiver. The receiver simply ran iperf -s. The senders ran iperf -c <receiver> -t 30. All senders started simultaneously (using local clocks synchronized by ntp). This test was repeated multiple times. Below shows the results from a single test. Other tests are similar. (DCTCP results were extremely consistent, CUBIC results show some variance induced by the TCP timeouts that CUBIC encountered.) For this test, we report statistics on the number of TCP timeouts, flow throughput, and traffic latency. 1) Timeouts (total over all flows, and per flow summaries): CUBIC DCTCP Total 3227 25 Mean 169.842 1.316 Median 183 1 Max 207 5 Min 123 0 Stddev 28.991 1.600 Timeout data is taken by measuring the net change in netstat -s "other TCP timeouts" reported. As a result, the timeout measurements above are not restricted to the test traffic, and we believe that it is likely that all of the "DCTCP timeouts" are actually timeouts for non-test traffic. We report them nevertheless. CUBIC will also include some non-test timeouts, but they are drawfed by bona fide test traffic timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing TCP timeouts. DCTCP reduces timeouts by at least two orders of magnitude and may well have eliminated them in this scenario. 2) Throughput (per flow in Mbps): CUBIC DCTCP Mean 521.684 521.895 Median 464 523 Max 776 527 Min 403 519 Stddev 105.891 2.601 Fairness 0.962 0.999 Throughput data was simply the average throughput for each flow reported by iperf. By avoiding TCP timeouts, DCTCP is able to achieve much better per-flow results. In CUBIC, many flows experience TCP timeouts which makes flow throughput unpredictable and unfair. DCTCP, on the other hand, provides very clean predictable throughput without incurring TCP timeouts. Thus, the standard deviation of CUBIC throughput is dramatically higher than the standard deviation of DCTCP throughput. Mean throughput is nearly identical because even though cubic flows suffer TCP timeouts, other flows will step in and fill the unused bandwidth. Note that this test is something of a best case scenario for incast under CUBIC: it allows other flows to fill in for flows experiencing a timeout. Under situations where the receiver is issuing requests and then waiting for all flows to complete, flows cannot fill in for timed out flows and throughput will drop dramatically. 3) Latency (in ms): CUBIC DCTCP Mean 4.0088 0.04219 Median 4.055 0.0395 Max 4.2 0.085 Min 3.32 0.028 Stddev 0.1666 0.01064 Latency for each protocol was computed by running "ping -i 0.2 <receiver>" from a single sender to the receiver during the incast test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure that traffic traversed the DCTCP queue and was not dropped when the queue size was greater than the marking threshold. The summary statistics above are over all ping metrics measured between the single sender, receiver pair. The latency results for this test show a dramatic difference between CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer which incurs the maximum queue latency (more buffer memory will lead to high latency.) DCTCP, on the other hand, deliberately attempts to keep queue occupancy low. The result is a two orders of magnitude reduction of latency with DCTCP - even with a switch with relatively little RAM. Switches with larger amounts of RAM will incur increasing amounts of latency for CUBIC, but not for DCTCP. 4) Convergence and stability test: This test measured the time that DCTCP took to fairly redistribute bandwidth when a new flow commences. It also measured DCTCP's ability to remain stable at a fair bandwidth distribution. DCTCP is compared with CUBIC for this test. At the commencement of this test, a single flow is sending at maximum rate (near 10 Gbps) to a single receiver. One second after that first flow commences, a new flow from a distinct server begins sending to the same receiver as the first flow. After the second flow has sent data for 10 seconds, the second flow is terminated. The first flow sends for an additional second. Ideally, the bandwidth would be evenly shared as soon as the second flow starts, and recover as soon as it stops. The results of this test are shown below. Note that the flow bandwidth for the two flows was measured near the same time, but not simultaneously. DCTCP performs nearly perfectly within the measurement limitations of this test: bandwidth is quickly distributed fairly between the two flows, remains stable throughout the duration of the test, and recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth fairly, and has trouble remaining stable. CUBIC DCTCP Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2 0 9.93 0 0 9.92 0 0.5 9.87 0 0.5 9.86 0 1 8.73 2.25 1 6.46 4.88 1.5 7.29 2.8 1.5 4.9 4.99 2 6.96 3.1 2 4.92 4.94 2.5 6.67 3.34 2.5 4.93 5 3 6.39 3.57 3 4.92 4.99 3.5 6.24 3.75 3.5 4.94 4.74 4 6 3.94 4 5.34 4.71 4.5 5.88 4.09 4.5 4.99 4.97 5 5.27 4.98 5 4.83 5.01 5.5 4.93 5.04 5.5 4.89 4.99 6 4.9 4.99 6 4.92 5.04 6.5 4.93 5.1 6.5 4.91 4.97 7 4.28 5.8 7 4.97 4.97 7.5 4.62 4.91 7.5 4.99 4.82 8 5.05 4.45 8 5.16 4.76 8.5 5.93 4.09 8.5 4.94 4.98 9 5.73 4.2 9 4.92 5.02 9.5 5.62 4.32 9.5 4.87 5.03 10 6.12 3.2 10 4.91 5.01 10.5 6.91 3.11 10.5 4.87 5.04 11 8.48 0 11 8.49 4.94 11.5 9.87 0 11.5 9.9 0 SYN/ACK ECT test: This test demonstrates the importance of ECT on SYN and SYN-ACK packets by measuring the connection probability in the presence of competing flows for a DCTCP connection attempt *without* ECT in the SYN packet. The test was repeated five times for each number of competing flows. Competing Flows 1 | 2 | 4 | 8 | 16 ------------------------------ Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0 Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0 As the number of competing flows moves beyond 1, the connection probability drops rapidly. Enabling DCTCP with this patch requires the following steps: DCTCP must be running both on the sender and receiver side in your data center, i.e.: sysctl -w net.ipv4.tcp_congestion_control=dctcp Also, ECN functionality must be enabled on all switches in your data center for DCTCP to work. The default ECN marking threshold (K) heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]). In above tests, for each switch port, traffic was segregated into two queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of 0x04 - the packet was placed into the DCTCP queue. All other packets were placed into the default drop-tail queue. For the DCTCP queue, RED/ECN marking was enabled, here, with a marking threshold of 75 KB. More details however, we refer you to the paper [2] under section 3). There are no code changes required to applications running in user space. DCTCP has been implemented in full *isolation* of the rest of the TCP code as its own congestion control module, so that it can run without a need to expose code to the core of the TCP stack, and thus nothing changes for non-DCTCP users. Changes in the CA framework code are minimal, and DCTCP algorithm operates on mechanisms that are already available in most Silicon. The gain (dctcp_shift_g) is currently a fixed constant (1/16) from the paper, but we leave the option that it can be chosen carefully to a different value by the user. In case DCTCP is being used and ECN support on peer site is off, DCTCP falls back after 3WHS to operate in normal TCP Reno mode. ss {-4,-6} -t -i diag interface: ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054 ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584 send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15 reordering:101 rcv_space:29200 ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448 cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate 325.5Mbps rcv_rtt:1.5 rcv_space:29200 More information about DCTCP can be found in [1-4]. [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: tcp: more detailed ACK events and events for CE marked packetsFlorian Westphal
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information and ACK properties, e.g. ACK that updates window is treated differently than DUPACK. Also DCTCP needs information whether ACK was delayed ACK. Furthermore, DCTCP also implements a CE state machine that keeps track of CE markings of incoming packets. Therefore, extend the congestion control framework to provide these event types, so that DCTCP can be properly implemented as a normal congestion algorithm module outside of the core stack. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: tcp: split ack slow/fast events from cwnd_eventFlorian Westphal
The congestion control ops "cwnd_event" currently supports CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others). Both FAST and SLOW_ACK are only used by Westwood congestion control algorithm. This removes both flags from cwnd_event and adds a new in_ack_event callback for this. The goal is to be able to provide more detailed information about ACKs, such as whether ECE flag was set, or whether the ACK resulted in a window update. It is required for DataCenter TCP (DCTCP) congestion control algorithm as it makes a different choice depending on ECE being set or not. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: tcp: add flag for ca to indicate that ECN is requiredDaniel Borkmann
This patch adds a flag to TCP congestion algorithms that allows for requesting to mark IPv4/IPv6 sockets with transport as ECN capable, that is, ECT(0), when required by a congestion algorithm. It is currently used and needed in DataCenter TCP (DCTCP), as it requires both peers to assert ECT on all IP packets sent - it uses ECN feedback (i.e. CE, Congestion Encountered information) from switches inside the data center to derive feedback to the end hosts. Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's algorithm/behaviour slightly diverges from RFC3168, therefore this is only (!) enabled iff the assigned congestion control ops module has requested this. By that, we can tightly couple this logic really only to the provided congestion control ops. Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: tcp: assign tcp cong_ops when tcp sk is createdFlorian Westphal
Split assignment and initialization from one into two functions. This is required by followup patches that add Datacenter TCP (DCTCP) congestion control algorithm - we need to be able to determine if the connection is moderated by DCTCP before the 3WHS has finished. As we walk the available congestion control list during the assignment, we are always guaranteed to have Reno present as it's fixed compiled-in. Therefore, since we're doing the early assignment, we don't have a real use for the Reno alias tcp_init_congestion_ops anymore and can thus remove it. Actual usage of the congestion control operations are being made after the 3WHS has finished, in some cases however we can access get_info() via diag if implemented, therefore we need to zero out the private area for those modules. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29net: sched: cls_rcvp, complete rcu conversionJohn Fastabend
This completes the cls_rsvp conversion to RCU safe copy, update semantics. As a result all cases of tcf_exts_change occur on empty lists now. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_sched: fix another regression in cls_tcindexWANG Cong
Clearly the following change is not expected: - if (!cp.perfect && !cp.h) - cp.alloc_hash = cp.hash; + if (!cp->perfect && cp->h) + cp->alloc_hash = cp->hash; Fixes: commit 331b72922c5f58d48fd ("net: sched: RCU cls_tcindex") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_sched: fix errno in tcindex_set_parms()WANG Cong
When kmemdup() fails, we should return -ENOMEM. Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28arp: Do not perturb drop profiles with ignored ARP packetsRick Jones
We do not wish to disturb dropwatch or perf drop profiles with an ARP we will ignore. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_sched: remove the first parameter from tcf_exts_destroy()WANG Cong
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <hadi@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2014-09-25 1) Remove useless hash_resize_mutex in xfrm_hash_resize(). This mutex is used only there, but xfrm_hash_resize() can't be called concurrently at all. From Ying Xue. 2) Extend policy hashing to prefixed policies based on prefix lenght thresholds. From Christophe Gouault. 3) Make the policy hash table thresholds configurable via netlink. From Christophe Gouault. 4) Remove the maximum authentication length for AH. This was needed to limit stack usage. We switched already to allocate space, so no need to keep the limit. From Herbert Xu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28neigh: check error pointer instead of NULL for ipv4_neigh_lookup()WANG Cong
Fixes: commit f187bc6efb7250afee0e2009b6106 ("ipv4: No need to set generic neighbour pointer") Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: allow switches driver to implement get/set EEEFlorian Fainelli
Allow switches driver to query and enable/disable EEE on a per-port basis by implementing the ethtool_{get,set}_eee settings and delegating these operations to the switch driver. set_eee() will need to coordinate with the PHY driver to make sure that EEE is enabled, the link-partner supports it and the auto-negotiation result is satisfactory. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: allow enabling and disable switch portsFlorian Fainelli
Whenever a per-port network device is used/unused, invoke the switch driver port_enable/port_disable callbacks to allow saving as much power as possible by disabling unused parts of the switch (RX/TX logic, memory arrays, PHYs...). We supply a PHY device argument to make sure the switch driver can act on the PHY device if needed (like putting/taking the PHY out of deep low power mode). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: dsa: start and stop the PHY state machineFlorian Fainelli
dsa_slave_open() should start the PHY library state machine for its PHY interface, and dsa_slave_close() should stop the PHY library state machine accordingly. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: use tcp_flags in tcp_data_queue()Peter Pan(潘卫平)
This patch is a cleanup which follows the idea in commit e11ecddf5128 (tcp: use TCP_SKB_CB(skb)->tcp_flags in input path), and it may reduce register pressure since skb->cb[] access is fast, bacause skb is probably in a register. v2: remove variable th v3: reword the changelog Signed-off-by: Weiping Pan <panweiping3@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: change tcp_skb_pcount() locationEric Dumazet
Our goal is to access no more than one cache line access per skb in a write or receive queue when doing the various walks. After recent TCP_SKB_CB() reorganizations, it is almost done. Last part is tcp_skb_pcount() which currently uses skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs 3 cache lines in current kernel (skb->head, skb->end, and shinfo->gso_segs are all in 3 different cache lines, far from skb->cb) This very simple patch reuses space currently taken by tcp_tw_isn only in input path, as tcp_skb_pcount is only needed for skb stored in write queue. This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags to get SKBTX_ACK_TSTAMP, which seems possible. This also speeds up all sack processing in general. This speeds up tcp_sendmsg() because it no longer has to access/dirty shinfo. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28tcp: better TCP_SKB_CB layout to reduce cache line missesEric Dumazet
TCP maintains lists of skb in write queue, and in receive queues (in order and out of order queues) Scanning these lists both in input and output path usually requires access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq These fields are currently in two different cache lines, meaning we waste lot of memory bandwidth when these queues are big and flows have either packet drops or packet reorders. We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because this header is not used in fast path. This allows TCP to search much faster in the skb lists. Even with regular flows, we save one cache line miss in fast path. Thanks to Christoph Paasch for noticing we need to cleanup skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path, and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted()Eric Dumazet
ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm that it needs. Lets not assume this, as TCP stack might use a different place. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ipv4: rename ip_options_echo to __ip_options_echo()Eric Dumazet
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt Lets break this assumption, but provide a helper to not change all call points. ip_send_unicast_reply() gets a new struct ip_options pointer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ip6_gre: Return an error when adding an existing tunnel.Steffen Klassert
ip6gre_tunnel_locate() should not return an existing tunnel if create is true. Otherwise it is possible to add the same tunnel multiple times without getting an error. So return NULL if the tunnel that should be created already exists. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ip6_vti: Return an error when adding an existing tunnel.Steffen Klassert
vti6_locate() should not return an existing tunnel if create is true. Otherwise it is possible to add the same tunnel multiple times without getting an error. So return NULL if the tunnel that should be created already exists. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ip6_tunnel: Return an error when adding an existing tunnel.Steffen Klassert
ip6_tnl_locate() should not return an existing tunnel if create is true. Otherwise it is possible to add the same tunnel multiple times without getting an error. So return NULL if the tunnel that should be created already exists. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net: make tcp_cleanup_rbuf privateDan Williams
net_dma was the only external user so this can become local to tcp.c again. Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28net_dma: revert 'copied_early'Dan Williams
Now that tcp_dma_try_early_copy() is gone nothing ever sets copied_early. Also reverts "53240c208776 tcp: Fix possible double-ack w/ user dma" since it is no longer necessary. Cc: Ali Saidi <saidi@engin.umich.edu> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Neal Cardwell <ncardwell@google.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28net_dma: simple removalDan Williams
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-26ip6gre: add a rtnl link alias for ip6gretapNicolas Dichtel
With this alias, we don't need to load manually the module before adding an ip6gretap interface with iproute2. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net : optimize skb_release_data()Eric Dumazet
Cache skb_shinfo(skb) in a variable to avoid computing it multiple times. Reorganize the tests to remove one indentation level. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net/openvswitch: remove dup comment in vport.hWang Sheng-Hui
Remove the duplicated comment "/* The following definitions are for users of the vport subsytem: */" in vport.h Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== nf pull request for net This series contains netfilter fixes for net, they are: 1) Fix lockdep splat in nft_hash when releasing sets from the rcu_callback context. We don't the mutex there anymore. 2) Remove unnecessary spinlock_bh in the destroy path of the nf_tables rbtree set type from rcu_callback context. 3) Fix another lockdep splat in rhashtable. None of the callers hold a mutex when calling rhashtable_destroy. 4) Fix duplicated error reporting from nfnetlink when aborting and replaying a batch. 5) Fix a Kconfig issue reported by kbuild robot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: optimise inet_proto_csum_replace4()LEROY Christophe
csum_partial() is a generic function which is not optimised for small fixed length calculations, and its use requires to store "from" and "to" values in memory while we already have them available in registers. This also has impact, especially on RISC processors. In the same spirit as the change done by Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4() taking into account RFC1624. I spotted during a NATted tcp transfert that csum_partial() is one of top 5 consuming functions (around 8%), and the second user of csum_partial() is inet_proto_csum_replace4(). Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: introduce __skb_header_release()Eric Dumazet
While profiling TCP stack, I noticed one useless atomic operation in tcp_sendmsg(), caused by skb_header_release(). It turns out all current skb_header_release() users have a fresh skb, that no other user can see, so we can avoid one atomic operation. Introduce __skb_header_release() to clearly document this. This gave me a 1.5 % improvement on TCP_RR workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge tag 'master-2014-09-16' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next John W. Linville says: ==================== pull request: wireless-next 2014-09-22 Please pull this batch of updates intended for the 3.18 stream... For the mac80211 bits, Johannes says: "This time, I have some rate minstrel improvements, support for a very small feature from CCX that Steinar reverse-engineered, dynamic ACK timeout support, a number of changes for TDLS, early support for radio resource measurement and many fixes. Also, I'm changing a number of places to clear key memory when it's freed and Intel claims copyright for code they developed." For the bluetooth bits, Johan says: "Here are some more patches intended for 3.18. Most of them are cleanups or fixes for SMP. The only exception is a fix for BR/EDR L2CAP fixed channels which should now work better together with the L2CAP information request procedure." For the iwlwifi bits, Emmanuel says: "I fix here dvm which was broken by my last pull request. Arik continues to work on TDLS and Luca solved a few issues in CT-Kill. Eyal keeps digging into rate scaling code, more to come soon. Besides this, nothing really special here." Beyond that, there are the usual big batches of updates to ath9k, b43, mwifiex, and wil6210 as well as a handful of other bits here and there. Also, rtlwifi gets some btcoexist attention from Larry. Please let me know if there are problems! ==================== Had to adjust the wil6210 code to comply with Joe Perches's recent change in net-next to make the netdev_*() routines return void instead of 'int'. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: Change netdev_<level> logging functions to return voidJoe Perches
No caller or macro uses the return value so make all the functions return void. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26Merge branch 'for-upstream' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
2014-09-26Merge tag 'nfc-next-3.18-1' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz <sameo@linux.intel.com> says: "NFC: 3.18 pull request This is the NFC pull request for 3.18. We've had major updates for TI and ST Microelectronics drivers: For TI's trf7970a driver: - Target mode support for trf7970a - Suspend/resume support for trf7970a - DT properties additions to handle different quirks - A bunch of fixes for smartphone IOP related issues For ST Microelectronics' ST21NFCA and ST21NFCB drivers: - ISO15693 support for st21nfcb - checkpatch and sparse related warning fixes - Code cleanups and a few minor fixes Finally, Marvell add ISO15693 support to the NCI stack, together with a couple of NCI fixes." Signed-off-by: John W. Linville <linville@tuxdriver.com>
2014-09-26netfilter: bridge: move br_netfilter out of the corePablo Neira Ayuso
Jesper reported that br_netfilter always registers the hooks since this is part of the bridge core. This harms performance for people that don't need this. This patch modularizes br_netfilter so it can be rmmod'ed, thus, the hooks can be unregistered. I think the bridge netfilter should have been a separated module since the beginning, Patrick agreed on that. Note that this is breaking compatibility for users that expect that bridge netfilter is going to be available after explicitly 'modprobe bridge' or via automatic load through brctl. However, the damage can be easily undone by modprobing br_netfilter. The bridge core also spots a message to provide a clue to people that didn't notice that this has been deprecated. On top of that, the plan is that nftables will not rely on this software layer, but integrate the connection tracking into the bridge layer to enable stateful filtering and NAT, which is was bridge netfilter users seem to require. This patch still keeps the fake_dst_ops in the bridge core, since this is required by when the bridge port is initialized. So we can safely modprobe/rmmod br_netfilter anytime. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Florian Westphal <fw@strlen.de>
2014-09-26netfilter: bridge: nf_bridge_copy_header as static inline in headerPablo Neira Ayuso
Move nf_bridge_copy_header() as static inline in netfilter_bridge.h header file. This patch prepares the modularization of the br_netfilter code. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-09-26net/netfilter/x_tables.c: use __seq_open_private()Rob Jones
Reduce boilerplate code by using __seq_open_private() instead of seq_open() in xt_match_open() and xt_target_open(). Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-09-26ip_tunnel: Don't allow to add the same tunnel multiple times.Steffen Klassert
When we try to add an already existing tunnel, we don't return an error. Instead we continue and call ip_tunnel_update(). This means that we can change existing tunnels by adding the same tunnel multiple times. It is even possible to change the tunnel endpoints of the fallback device. We fix this by returning an error if we try to add an existing tunnel. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: sched: use pinned timersEric Dumazet
While using a MQ + NETEM setup, I had confirmation that the default timer migration ( /proc/sys/kernel/timer_migration ) is killing us. Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX queues) : EST="est 1sec 4sec" for ETH in eth1 do tc qd del dev $ETH root 2>/dev/null tc qd add dev $ETH root handle 1: mq tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms done We can see that timers get migrated into a single cpu, presumably idle at the time timers are set up. Then all qdisc dequeues run from this cpu and huge lock contention happens. This single cpu is stuck in softirq mode and cannot dequeue fast enough. 39.24% [kernel] [k] _raw_spin_lock 2.65% [kernel] [k] netem_enqueue 1.80% [kernel] [k] netem_dequeue 1.63% [kernel] [k] copy_user_enhanced_fast_string 1.45% [kernel] [k] _raw_spin_lock_bh By pinning qdisc timers on the cpu running the qdisc, we respect proper XPS setting and remove this lock contention. 5.84% [kernel] [k] netem_enqueue 4.83% [kernel] [k] _raw_spin_lock 2.92% [kernel] [k] copy_user_enhanced_fast_string Current Qdiscs that benefit from this change are : netem, cbq, fq, hfsc, tbf, htb. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26net: Remove gso_send_check as an offload callbackTom Herbert
The send_check logic was only interesting in cases of TCP offload and UDP UFO where the checksum needed to be initialized to the pseudo header checksum. Now we've moved that logic into the related gso_segment functions so gso_send_check is no longer needed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26udp: move logic out of udp[46]_ufo_send_checkTom Herbert
In udp[46]_ufo_send_check the UDP checksum initialized to the pseudo header checksum. We can move this logic into udp[46]_ufo_fragment. After this change udp[64]_ufo_send_check is a no-op. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26tcp: move logic out of tcp_v[64]_gso_send_checkTom Herbert
In tcp_v[46]_gso_send_check the TCP checksum is initialized to the pseudo header checksum using __tcp_v[46]_send_check. We can move this logic into new tcp[46]_gso_segment functions to be done when ip_summed != CHECKSUM_PARTIAL (ip_summed == CHECKSUM_PARTIAL should be the common case, possibly always true when taking GSO path). After this change tcp_v[46]_gso_send_check is no-op. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-25SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUTTrond Myklebust
The flag RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT was intended introduced in order to allow NFSv4 clients to disable resend timeouts. Since those cause the RPC layer to break the connection, they mess up the duplicate reply caches that remain indexed on the port number in NFSv4.. This patch includes the code that was missing in the original to set the appropriate flag in struct rpc_clnt, when the caller of rpc_create() sets RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT. Fixes: 8a19a0b6cb2e (SUNRPC: Add RPC task and client level options to...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()NeilBrown
Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25Bluetooth: Rename sco_param_wideband table to esco_param_msbcJohan Hedberg
The sco_param_wideband table represents the eSCO parameters for specifically mSBC encoding. This patch renames the table to the more descriptive esco_param_msbc name. Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2014-09-24rpc: Add -EPERM processing for xs_udp_send_request()Jason Baron
If an iptables drop rule is added for an nfs server, the client can end up in a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM is ignored since the prior bits of the packet may have been successfully queued and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request() thinks that because some bits were queued it should return -EAGAIN. We then try the request again and again, resulting in cpu spinning. Reproducer: 1) open a file on the nfs server '/nfs/foo' (mounted using udp) 2) iptables -A OUTPUT -d <nfs server ip> -j DROP 3) write to /nfs/foo 4) close /nfs/foo 5) iptables -D OUTPUT -d <nfs server ip> -j DROP The softlockup occurs in step 4 above. The previous patch, allows xs_sendpages() to return both a sent count and any error values that may have occurred. Thus, if we get an -EPERM, return that to the higher level code. With this patch in place we can successfully abort the above sequence and avoid the softlockup. I also tried the above test case on an nfs mount on tcp and although the system does not softlockup, I still ended up with the 'hung_task' firing after 120 seconds, due to the i/o being stuck. The tcp case appears a bit harder to fix, since -EPERM appears to get ignored much lower down in the stack and does not propogate up to xs_sendpages(). This case is not quite as insidious as the softlockup and it is not addressed here. Reported-by: Yigong Lou <ylou@akamai.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24rpc: return sent and err from xs_sendpages()Jason Baron
If an error is returned after the first bits of a packet have already been successfully queued, xs_sendpages() will return a positive 'int' value indicating success. Callers seem to treat this as -EAGAIN. However, there are cases where its not a question of waiting for the write queue to drain. For example, when there is an iptables rule dropping packets to the destination, the lower level code can return -EPERM only after parts of the packet have been successfully queued. In this case, we can end up continuously retrying resulting in a kernel softlockup. This patch is intended to make no changes in behavior but is in preparation for subsequent patches that can make decisions based on both on the number of bytes sent by xs_sendpages() and any errors that may have be returned. Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24SUNRPC: Don't wake tasks during connection abortBenjamin Coddington
When aborting a connection to preserve source ports, don't wake the task in xs_error_report. This allows tasks with RPC_TASK_SOFTCONN to succeed if the connection needs to be re-established since it preserves the task's status instead of setting it to the status of the aborting kernel_connect(). This may also avoid a potential conflict on the socket's lock. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org # 3.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>