summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2014-11-11net: Convert LIMIT_NETDEBUG to net_dbg_ratelimitedJoe Perches
Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11net: introduce SO_INCOMING_CPUEric Dumazet
Alternative to RPS/RFS is to use hardware support for multiple queues. Then split a set of million of sockets into worker threads, each one using epoll() to manage events on its own socket pool. Ideally, we want one thread per RX/TX queue/cpu, but we have no way to know after accept() or connect() on which queue/cpu a socket is managed. We normally use one cpu per RX queue (IRQ smp_affinity being properly set), so remembering on socket structure which cpu delivered last packet is enough to solve the problem. After accept(), connect(), or even file descriptor passing around processes, applications can use : int cpu; socklen_t len = sizeof(cpu); getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len); And use this information to put the socket into the right silo for optimal performance, as all networking stack should run on the appropriate cpu, without need to send IPI (RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11tcp: move sk_mark_napi_id() at the right placeEric Dumazet
sk_mark_napi_id() is used to record for a flow napi id of incoming packets for busypoll sake. We should do this only on established flows, not on listeners. This was 'working' by virtue of the socket cloning, but doing this on SYN packets in unecessary cache line dirtying. Even if we move sk_napi_id in the same cache line than sk_lock, we are working to make SYN processing lockless, so it is desirable to set sk_napi_id only for established flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10ipv4: Avoid reading user iov twice after raw_probe_proto_optHerbert Xu
Ever since raw_probe_proto_opt was added it had the problem of causing the user iov to be read twice, once during the probe for the protocol header and once again in ip_append_data. This is a potential security problem since it means that whatever we're probing may be invalid. This patch plugs the hole by firstly advancing the iov so we don't read the same spot again, and secondly saving what we read the first time around for use by ip_append_data. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10ipv4: Use standard iovec primitive in raw_probe_proto_optHerbert Xu
The function raw_probe_proto_opt tries to extract the first two bytes from the user input in order to seed the IPsec lookup for ICMP packets. In doing so it's processing iovec by hand and overcomplicating things. This patch replaces the manual iovec processing with a call to memcpy_fromiovecend. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicastsRick Jones
As NIC multicast filtering isn't perfect, and some platforms are quite content to spew broadcasts, we should not trigger an event for skb:kfree_skb when we do not have a match for such an incoming datagram. We do though want to avoid sweeping the matter under the rug entirely, so increment a suitable statistic. This incorporates feedback from David L. Stevens, Karl Neiss and Eric Dumazet. V3 - use bool per David Miller Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-11-06Merge branch 'net_next_ovs' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/pshelar/openvswitch Pravin B Shelar says: ==================== Open vSwitch First two patches are related to OVS MPLS support. Rest of patches are mostly refactoring and minor improvements to openvswitch. v1-v2: - Fix conflicts due to "gue: Remote checksum offload" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06net: esp: Convert NETDEBUG to pr_infoJoe Perches
Commit 64ce207306de ("[NET]: Make NETDEBUG pure printk wrappers") originally had these NETDEBUG printks as always emitting. Commit a2a316fd068c ("[NET]: Replace CONFIG_NET_DEBUG with sysctl") added a net_msg_warn sysctl to these NETDEBUG uses. Convert these NETDEBUG uses to normal pr_info calls. This changes the output prefix from "ESP: " to include "IPSec: " for the ipv4 case and "IPv6: " for the ipv6 case. These output lines are now like the other messages in the files. Other miscellanea: Neaten the arithmetic spacing to be consistent with other arithmetic spacing in the files. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06net; ipv[46] - Remove 2 unnecessary NETDEBUG OOM messagesJoe Perches
These messages aren't useful as there's a generic dump_stack() on OOM. Neaten the comment and if test above the OOM by separating the assign in if into an allocation then if test. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Remove MPLS GSO feature.Pravin B Shelar
Device can export MPLS GSO support in dev->mpls_features same way it export vlan features in dev->vlan_features. So it is safe to remove NETIF_F_GSO_MPLS redundant flag. Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
2014-11-05fou: Fix typo in returning flags in netlinkTom Herbert
When filling netlink info, dport is being returned as flags. Fix instances to return correct value. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUsDaniel Borkmann
It has been reported that generating an MLD listener report on devices with large MTUs (e.g. 9000) and a high number of IPv6 addresses can trigger a skb_over_panic(): skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20 head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0 dev:port1 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:100! invalid opcode: 0000 [#1] SMP Modules linked in: ixgbe(O) CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4 [...] Call Trace: <IRQ> [<ffffffff80578226>] ? skb_put+0x3a/0x3b [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4 [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45 [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68 [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182 [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3 [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46 [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70 mld_newpack() skb allocations are usually requested with dev->mtu in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations") we have changed the limit in order to be less likely to fail. However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb) macros, which determine if we may end up doing an skb_put() for adding another record. To avoid possible fragmentation, we check the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong assumption as the actual max allocation size can be much smaller. The IGMP case doesn't have this issue as commit 57e1ab6eaddc ("igmp: refine skb allocations") stores the allocation size in the cb[]. Set a reserved_tailroom to make it fit into the MTU and use skb_availroom() helper instead. This also allows to get rid of igmp_skb_size(). Reported-by: Wei Liu <lw1a2.jing@gmail.com> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: David L Stevens <david.stevens@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Convert SEQ_START_TOKEN/seq_printf to seq_putsJoe Perches
Using a single fixed string is smaller code size than using a format and many string arguments. Reduces overall code size a little. $ size net/ipv4/igmp.o* net/ipv6/mcast.o* net/ipv6/ip6_flowlabel.o* text data bss dec hex filename 34269 7012 14824 56105 db29 net/ipv4/igmp.o.new 34315 7012 14824 56151 db57 net/ipv4/igmp.o.old 30078 7869 13200 51147 c7cb net/ipv6/mcast.o.new 30105 7869 13200 51174 c7e6 net/ipv6/mcast.o.old 11434 3748 8580 23762 5cd2 net/ipv6/ip6_flowlabel.o.new 11491 3748 8580 23819 5d0b net/ipv6/ip6_flowlabel.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05tcp: zero retrans_stamp if all retrans were ackedMarcelo Leitner
Ueki Kohei reported that when we are using NewReno with connections that have a very low traffic, we may timeout the connection too early if a second loss occurs after the first one was successfully acked but no data was transfered later. Below is his description of it: When SACK is disabled, and a socket suffers multiple separate TCP retransmissions, that socket's ETIMEDOUT value is calculated from the time of the *first* retransmission instead of the *latest* retransmission. This happens because the tcp_sock's retrans_stamp is set once then never cleared. Take the following connection: Linux remote-machine | | send#1---->(*1)|--------> data#1 --------->| | | | RTO : : | | | ---(*2)|----> data#1(retrans) ---->| | (*3)|<---------- ACK <----------| | | | | : : | : : | : : 16 minutes (or more) : | : : | : : | : : | | | send#2---->(*4)|--------> data#2 --------->| | | | RTO : : | | | ---(*5)|----> data#2(retrans) ---->| | | | | | | RTO*2 : : | | | | | | ETIMEDOUT<----(*6)| | (*1) One data packet sent. (*2) Because no ACK packet is received, the packet is retransmitted. (*3) The ACK packet is received. The transmitted packet is acknowledged. At this point the first "retransmission event" has passed and been recovered from. Any future retransmission is a completely new "event". (*4) After 16 minutes (to correspond with retries2=15), a new data packet is sent. Note: No data is transmitted between (*3) and (*4). The socket's timeout SHOULD be calculated from this point in time, but instead it's calculated from the prior "event" 16 minutes ago. (*5) Because no ACK packet is received, the packet is retransmitted. (*6) At the time of the 2nd retransmission, the socket returns ETIMEDOUT. Therefore, now we clear retrans_stamp as soon as all data during the loss window is fully acked. Reported-by: Ueki Kohei Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Add and use skb_copy_datagram_msg() helper.David S. Miller
This encapsulates all of the skb_copy_datagram_iovec() callers with call argument signature "skb, offset, msghdr->msg_iov, length". When we move to iov_iters in the networking, the iov_iter object will sit in the msghdr. Having a helper like this means there will be less places to touch during that transformation. Based upon descriptions and patch from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05gue: Receive side of remote checksum offloadTom Herbert
Add processing of the remote checksum offload option in both the normal path as well as the GRO path. The implements patching the affected checksum to derive the offloaded checksum. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05gue: TX support for using remote checksum offload optionTom Herbert
Add if_tunnel flag TUNNEL_ENCAP_FLAG_REMCSUM to configure remote checksum offload on an IP tunnel. Add logic in gue_build_header to insert remote checksum offload option. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05udp: Changes to udp_offload to support remote checksum offloadTom Herbert
Add a new GSO type, SKB_GSO_TUNNEL_REMCSUM, which indicates remote checksum offload being done (in this case inner checksum must not be offloaded to the NIC). Added logic in __skb_udp_tunnel_segment to handle remote checksum offload case. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05gue: Add infrastructure for flags and optionsTom Herbert
Add functions and basic definitions for processing standard flags, private flags, and control messages. This includes definitions to compute length of optional fields corresponding to a set of flags. Flag validation is in validate_gue_flags function. This checks for unknown flags, and that length of optional fields is <= length in guehdr hlen. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05udp: Offload outer UDP tunnel csum if availableTom Herbert
In __skb_udp_tunnel_segment if outer UDP checksums are enabled and ip_summed is not already CHECKSUM_PARTIAL, set up checksum offload if device features allow it. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05net: Move fou_build_header into fou.c and refactorTom Herbert
Move fou_build_header out of ip_tunnel.c and into fou.c splitting it up into fou_build_header, gue_build_header, and fou_build_udp. This allows for other users for TX of FOU or GUE. Change ip_tunnel_encap to call fou_build_header or gue_build_header based on the tunnel encapsulation type. Similarly, added fou_encap_hlen and gue_encap_hlen functions which are called by ip_encap_hlen. New net/fou.h has prototypes and defines for this. Added NET_FOU_IP_TUNNELS configuration. When this is set, IP tunnels can use FOU/GUE and fou module is also selected. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05geneve: Unregister pernet subsys on module unload.Jesse Gross
The pernet ops aren't ever unregistered, which causes a memory leak and an OOPs if the module is ever reinserted. Fixes: 0b5e8b8eeae4 ("net: Add Geneve tunneling protocol driver") CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05geneve: Set GSO type on transmit.Jesse Gross
Geneve does not currently set the inner protocol type when transmitting packets. This causes GSO segmentation to fail on NICs that do not support Geneve offloading. CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04udp: remove blank line between set and testFabian Frederick
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04esp4: remove assignment in if conditionFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04net: allow setting ecn via routing tableFlorian Westphal
This patch allows to set ECN on a per-route basis in case the sysctl tcp_ecn is not set to 1. In other words, when ECN is set for specific routes, it provides a tcp_ecn=1 behaviour for that route while the rest of the stack acts according to the global settings. One can use 'ip route change dev $dev $net features ecn' to toggle this. Having a more fine-grained per-route setting can be beneficial for various reasons, for example, 1) within data centers, or 2) local ISPs may deploy ECN support for their own video/streaming services [1], etc. There was a recent measurement study/paper [2] which scanned the Alexa's publicly available top million websites list from a vantage point in US, Europe and Asia: Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side only ECN") ;)); the break in connectivity on-path was found is about 1 in 10,000 cases. Timeouts rather than receiving back RSTs were much more common in the negotiation phase (and mostly seen in the Alexa middle band, ranks around 50k-150k): from 12-thousand hosts on which there _may_ be ECN-linked connection failures, only 79 failed with RST when _not_ failing with RST when ECN is not requested. It's unclear though, how much equipment in the wild actually marks CE when buffers start to fill up. We thought about a fallback to non-ECN for retransmitted SYNs as another global option (which could perhaps one day be made default), but as Eric points out, there's much more work needed to detect broken middleboxes. Two examples Eric mentioned are buggy firewalls that accept only a single SYN per flow, and middleboxes that successfully let an ECN flow establish, but later mark CE for all packets (so cwnd converges to 1). [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15 [2] http://ecn.ethz.ch/ Joint work with Daniel Borkmann. Reference: http://thread.gmane.org/gmane.linux.network/335797 Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04syncookies: split cookie_check_timestamp() into two functionsFlorian Westphal
The function cookie_check_timestamp(), both called from IPv4/6 context, is being used to decode the echoed timestamp from the SYN/ACK into TCP options used for follow-up communication with the peer. We can remove ECN handling from that function, split it into a separate one, and simply rename the original function into cookie_decode_options(). cookie_decode_options() just fills in tcp_option struct based on the echoed timestamp received from the peer. Anything that fails in this function will actually discard the request socket. While this is the natural place for decoding options such as ECN which commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue that in particular for ECN handling, it can be checked at a later point in time as the request sock would actually not need to be dropped from this, but just ECN support turned off. Therefore, we split this functionality into cookie_ecn_ok(), which tells us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled. This prepares for per-route ECN support: just looking at the tcp_ecn sysctl won't be enough anymore at that point; if the timestamp indicates ECN and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric. This would mean adding a route lookup to cookie_check_timestamp(), which we definitely want to avoid. As we already do a route lookup at a later point in cookie_{v4,v6}_check(), we can simply make use of that as well for the new cookie_ecn_ok() function w/o any additional cost. Joint work with Daniel Borkmann. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04syncookies: avoid magic values and document which-bit-is-what-optionFlorian Westphal
Was a bit more difficult to read than needed due to magic shifts; add defines and document the used encoding scheme. Joint work with Daniel Borkmann. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04igmp: remove camel case definitionsFabian Frederick
use standard uppercase for definitions Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04udp: remove else after returnFabian Frederick
else is unnecessary after return 0 in __udp4_lib_rcv() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04inet: frags: remove inline on static in c fileFabian Frederick
remove __inline__ / inline and let compiler decide what to do with static functions Inspired-by: "David S. Miller" <davem@davemloft.net> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04ipv4: remove 0/NULL assignment on staticFabian Frederick
static values are automatically initialized to 0 Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04ipv4: use seq_puts instead of seq_printf where possibleFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04tcp: spelling s/plugable/pluggableFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04cipso: remove NULL assignment on staticFabian Frederick
Also add blank line after structure declarations Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04ipv4: include linux/bug.h instead of asm/bug.hFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-04cipso: kerneldoc warning fixFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/marvell.c Simple overlapping changes in drivers/net/phy/marvell.c Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/ipvs fixes for net The following patchset contains fixes for netfilter/ipvs. This round of fixes is larger than usual at this stage, specifically because of the nf_tables bridge reject fixes that I would like to see in 3.18. The patches are: 1) Fix a null-pointer dereference that may occur when logging errors. This problem was introduced by 4a4739d56b0 ("ipvs: Pull out crosses_local_route_boundary logic") in v3.17-rc5. 2) Update hook mask in nft_reject_bridge so we can also filter out packets from there. This fixes 36d2af5 ("netfilter: nf_tables: allow to filter from prerouting and postrouting"), which needs this chunk to work. 3) Two patches to refactor common code to forge the IPv4 and IPv6 reject packets from the bridge. These are required by the nf_tables reject bridge fix. 4) Fix nft_reject_bridge by avoiding the use of the IP stack to reject packets from the bridge. The idea is to forge the reject packets and inject them to the original port via br_deliver() which is now exported for that purpose. 5) Restrict nft_reject_bridge to bridge prerouting and input hooks. the original skbuff may cloned after prerouting when the bridge stack needs to flood it to several bridge ports, it is too late to reject the traffic. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-31netfilter: nf_reject_ipv4: split nf_send_reset() in smaller functionsPablo Neira Ayuso
That can be reused by the reject bridge expression to build the reject packet. The new functions are: * nf_reject_ip_tcphdr_get(): to sanitize and to obtain the TCP header. * nf_reject_iphdr_put(): to build the IPv4 header. * nf_reject_ip_tcphdr_put(): to build the TCP header. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-10-30net: skb_fclone_busy() needs to detect orphaned skbEric Dumazet
Some drivers are unable to perform TX completions in a bound time. They instead call skb_orphan() Problem is skb_fclone_busy() has to detect this case, otherwise we block TCP retransmits and can freeze unlucky tcp sessions on mostly idle hosts. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30tcp: Correction to RFC number in commentSowmini Varadhan
Challenge ACK is described in RFC 5961, fix typo. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30gre: Use inner mac length when computing tunnel lengthTom Herbert
Currently, skb_inner_network_header is used but this does not account for Ethernet header for ETH_P_TEB. Use skb_inner_mac_header which handles TEB and also should work with IP encapsulation in which case inner mac and inner network headers are the same. Tested: Ran TCP_STREAM over GRE, worked as expected. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30ipv4: Do not cache routing failures due to disabled forwarding.Nicolas Cavallari
If we cache them, the kernel will reuse them, independently of whether forwarding is enabled or not. Which means that if forwarding is disabled on the input interface where the first routing request comes from, then that unreachable result will be cached and reused for other interfaces, even if forwarding is enabled on them. The opposite is also true. This can be verified with two interfaces A and B and an output interface C, where B has forwarding enabled, but not A and trying ip route get $dst iif A from $src && ip route get $dst iif B from $src Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30syncookies: only increment SYNCOOKIESFAILED on validation errorFlorian Westphal
Only count packets that failed cookie-authentication. We can get SYNCOOKIESFAILED > 0 while we never even sent a single cookie. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-30ipv4: minor spelling fixesstephen hemminger
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29inet: frags: remove the WARN_ON from inet_evict_bucketNikolay Aleksandrov
The WARN_ON in inet_evict_bucket can be triggered by a valid case: inet_frag_kill and inet_evict_bucket can be running in parallel on the same queue which means that there has been at least one more ref added by a previous inet_frag_find call, but inet_frag_kill can delete the timer before inet_evict_bucket which will cause the WARN_ON() there to trigger since we'll have refcnt!=1. Now, this case is valid because the queue is being "killed" for some reason (removed from the chain list and its timer deleted) so it will get destroyed in the end by one of the inet_frag_put() calls which reaches 0 i.e. refcnt is still valid. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Reported-by: Patrick McLean <chutzpah@gentoo.org> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29inet: frags: fix a race between inet_evict_bucket and inet_frag_killNikolay Aleksandrov
When the evictor is running it adds some chosen frags to a local list to be evicted once the chain lock has been released but at the same time the *frag_queue can be running for some of the same queues and it may call inet_frag_kill which will wait on the chain lock and will then delete the queue from the wrong list since it was added in the eviction one. The fix is simple - check if the queue has the evict flag set under the chain lock before deleting it, this is safe because the evict flag is set only under that lock and having the flag set also means that the queue has been detached from the chain list, so no need to delete it again. An important note to make is that we're safe w.r.t refcnt because inet_frag_kill and inet_evict_bucket will sync on the del_timer operation where only one of the two can succeed (or if the timer is executing - none of them), the cases are: 1. inet_frag_kill succeeds in del_timer - then the timer ref is removed, but inet_evict_bucket will not add this queue to its expire list but will restart eviction in that chain 2. inet_evict_bucket succeeds in del_timer - then the timer ref is kept until the evictor "expires" the queue, but inet_frag_kill will remove the initial ref and will set INET_FRAG_COMPLETE which will make the frag_expire fn just to remove its ref. In the end all of the queue users will do an inet_frag_put and the one that reaches 0 will free it. The refcount balance should be okay. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Patrick McLean <chutzpah@gentoo.org> Tested-by: Patrick McLean <chutzpah@gentoo.org> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29tcp: allow for bigger reordering levelEric Dumazet
While testing upcoming Yaogong patch (converting out of order queue into an RB tree), I hit the max reordering level of linux TCP stack. Reordering level was limited to 127 for no good reason, and some network setups [1] can easily reach this limit and get limited throughput. Allow a new max limit of 300, and add a sysctl to allow admins to even allow bigger (or lower) values if needed. [1] Aggregation of links, per packet load balancing, fabrics not doing deep packet inspections, alternative TCP congestion modules... Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>