summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2017-09-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Support ipv6 checksum offload in sunvnet driver, from Shannon Nelson. 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric Dumazet. 3) Allow generic XDP to work on virtual devices, from John Fastabend. 4) Add bpf device maps and XDP_REDIRECT, which can be used to build arbitrary switching frameworks using XDP. From John Fastabend. 5) Remove UFO offloads from the tree, gave us little other than bugs. 6) Remove the IPSEC flow cache, from Florian Westphal. 7) Support ipv6 route offload in mlxsw driver. 8) Support VF representors in bnxt_en, from Sathya Perla. 9) Add support for forward error correction modes to ethtool, from Vidya Sagar Ravipati. 10) Add time filter for packet scheduler action dumping, from Jamal Hadi Salim. 11) Extend the zerocopy sendmsg() used by virtio and tap to regular sockets via MSG_ZEROCOPY. From Willem de Bruijn. 12) Significantly rework value tracking in the BPF verifier, from Edward Cree. 13) Add new jump instructions to eBPF, from Daniel Borkmann. 14) Rework rtnetlink plumbing so that operations can be run without taking the RTNL semaphore. From Florian Westphal. 15) Support XDP in tap driver, from Jason Wang. 16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal. 17) Add Huawei hinic ethernet driver. 18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan Delalande. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits) i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq i40e: avoid NVM acquire deadlock during NVM update drivers: net: xgene: Remove return statement from void function drivers: net: xgene: Configure tx/rx delay for ACPI drivers: net: xgene: Read tx/rx delay for ACPI rocker: fix kcalloc parameter order rds: Fix non-atomic operation on shared flag variable net: sched: don't use GFP_KERNEL under spin lock vhost_net: correctly check tx avail during rx busy polling net: mdio-mux: add mdio_mux parameter to mdio_mux_init() rxrpc: Make service connection lookup always check for retry net: stmmac: Delete dead code for MDIO registration gianfar: Fix Tx flow control deactivation cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6 cxgb4: Fix pause frame count in t4_get_port_stats cxgb4: fix memory leak tun: rename generic_xdp to skb_xdp tun: reserve extra headroom only when XDP is set net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping net: dsa: bcm_sf2: Advertise number of egress queues ...
2017-09-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-09-04net: Remove CONFIG_NETFILTER_DEBUG and _ASSERT() macros.Varsha Rao
This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they are no longer required. Replace _ASSERT() macros with WARN_ON(). Signed-off-by: Varsha Rao <rvarsha016@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04net: Replace NF_CT_ASSERT() with WARN_ON().Varsha Rao
This patch removes NF_CT_ASSERT() and instead uses WARN_ON(). Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
2017-09-04netfilter: remove unused hooknum arg from packet functionsFlorian Westphal
tested with allmodconfig build. Signed-off-by: Florian Westphal <fw@strlen.de>
2017-09-04Merge branch 'linus' into locking/core, to fix up conflictsIngo Molnar
Conflicts: mm/page_alloc.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, updates to the conntrack core, enhancements for nf_tables, conversion of netfilter hooks from linked list to array to improve memory locality and asorted improvements for the Netfilter codebase. More specifically, they are: 1) Add expection to hashes after timer initialization to prevent access from another CPU that walks on the hashes and calls del_timer(), from Florian Westphal. 2) Don't update nf_tables chain counters from hot path, this is only used by the x_tables compatibility layer. 3) Get rid of nested rcu_read_lock() calls from netfilter hook path. Hooks are always guaranteed to run from rcu read side, so remove nested rcu_read_lock() where possible. Patch from Taehee Yoo. 4) nf_tables new ruleset generation notifications include PID and name of the process that has updated the ruleset, from Phil Sutter. 5) Use skb_header_pointer() from nft_fib, so we can reuse this code from the nf_family netdev family. Patch from Pablo M. Bermudo. 6) Add support for nft_fib in nf_tables netdev family, also from Pablo. 7) Use deferrable workqueue for conntrack garbage collection, to reduce power consumption, from Patch from Subash Abhinov Kasiviswanathan. 8) Add nf_ct_expect_iterate_net() helper and use it. From Florian Westphal. 9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian. 10) Drop references on conntrack removal path when skbuffs has escaped via nfqueue, from Florian. 11) Don't queue packets to nfqueue with dying conntrack, from Florian. 12) Constify nf_hook_ops structure, from Florian. 13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter. 14) Add nla_strdup(), from Phil Sutter. 15) Rise nf_tables objects name size up to 255 chars, people want to use DNS names, so increase this according to what RFC 1035 specifies. Patch series from Phil Sutter. 16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook registration on demand, suggested by Eric Dumazet, patch from Florian. 17) Remove unused variables in compat_copy_entry_from_user both in ip_tables and arp_tables code. Patch from Taehee Yoo. 18) Constify struct nf_conntrack_l4proto, from Julia Lawall. 19) Constify nf_loginfo structure, also from Julia. 20) Use a single rb root in connlimit, from Taehee Yoo. 21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo. 22) Use audit_log() instead of open-coding it, from Geliang Tang. 23) Allow to mangle tcp options via nft_exthdr, from Florian. 24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes a fix for a miscalculation of the minimal length. 25) Simplify branch logic in h323 helper, from Nick Desaulniers. 26) Calculate netlink attribute size for conntrack tuple at compile time, from Florian. 27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure. From Florian. 28) Remove holes in nf_conntrack_l4proto structure, so it becomes smaller. From Florian. 29) Get rid of print_tuple() indirection for /proc conntrack listing. Place all the code in net/netfilter/nf_conntrack_standalone.c. Patch from Florian. 30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is off. From Florian. 31) Constify most nf_conntrack_{l3,l4}proto helper functions, from Florian. 32) Fix broken indentation in ebtables extensions, from Colin Ian King. 33) Fix several harmless sparse warning, from Florian. 34) Convert netfilter hook infrastructure to use array for better memory locality, joint work done by Florian and Aaron Conole. Moreover, add some instrumentation to debug this. 35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once per batch, from Florian. 36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian. 37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao. 38) Remove unused code in the generic protocol tracker, from Davide Caratti. I think I will have material for a second Netfilter batch in my queue if time allow to make it fit in this merge window. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03Revert "net: fix percpu memory leaks"Jesper Dangaard Brouer
This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb. After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") then here is no need for this fix-up patch. As percpu_counter is no longer used, it cannot memory leak it any-longer. Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119baf061 ("net: fix percpu memory leaks") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01net: Add module reference to FIB notifiersIdo Schimmel
When a listener registers to the FIB notification chain it receives a dump of the FIB entries and rules from existing address families by invoking their dump operations. While we call into these modules we need to make sure they aren't removed. Do that by increasing their reference count before invoking their dump operations and decrease it afterwards. Fixes: 04b1d4e50e82 ("net: core: Make the FIB notification chain generic") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01ipv6: sr: Use ARRAY_SIZE macroThomas Meyer
Grepping for "sizeof\(.+\) / sizeof\(" found this as one of the first candidates. Maybe a coccinelle can catch all of those. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31xfrm: Add support for network devices capable of removing the ESP trailerYossi Kuperman
In conjunction with crypto offload [1], removing the ESP trailer by hardware can potentially improve the performance by avoiding (1) a cache miss incurred by reading the nexthdr field and (2) the necessity to calculate the csum value of the trailer in order to keep skb->csum valid. This patch introduces the changes to the xfrm stack and merely serves as an infrastructure. Subsequent patch to mlx5 driver will put this to a good use. [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg175733.html Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-08-30ipv6: sr: fix get_srh() to comply with IPv6 standard "RFC 8200"Ahmed Abdelsalam
IPv6 packet may carry more than one extension header, and IPv6 nodes must accept and attempt to process extension headers in any order and occurring any number of times in the same packet. Hence, there should be no assumption that Segment Routing extension header is to appear immediately after the IPv6 header. Moreover, section 4.1 of RFC 8200 gives a recommendation on the order of appearance of those extension headers within an IPv6 packet. According to this recommendation, Segment Routing extension header should appear after Hop-by-Hop and Destination Options headers (if they present). This patch fixes the get_srh(), so it gets the segment routing header regardless of its position in the chain of the extension headers in IPv6 packet, and makes sure that the IPv6 routing extension header is of Type 4. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Acked-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29neigh: increase queue_len_bytes to match wmem_defaultEric Dumazet
Florian reported UDP xmit drops that could be root caused to the too small neigh limit. Current limit is 64 KB, meaning that even a single UDP socket would hit it, since its default sk_sndbuf comes from net.core.wmem_default (~212992 bytes on 64bit arches). Once ARP/ND resolution is in progress, we should allow a little more packets to be queued, at least for one producer. Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf limit and either block in sendmsg() or return -EAGAIN. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29ipv6: Use rt6i_idev index for echo replies to a local addressDavid Ahern
Tariq repored local pings to linklocal address is failing: $ ifconfig ens8 ens8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 11.141.16.6 netmask 255.255.0.0 broadcast 11.141.255.255 inet6 fe80::7efe:90ff:fecb:7502 prefixlen 64 scopeid 0x20<link> ether 7c:fe:90:cb:75:02 txqueuelen 1000 (Ethernet) RX packets 12 bytes 1164 (1.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 30 bytes 2484 (2.4 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 $ /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8 PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29ipv6: do not set sk_destruct in IPV6_ADDRFORM sockoptXin Long
ChunYu found a kernel warn_on during syzkaller fuzzing: [40226.038539] WARNING: CPU: 5 PID: 23720 at net/ipv4/af_inet.c:152 inet_sock_destruct+0x78d/0x9a0 [40226.144849] Call Trace: [40226.147590] <IRQ> [40226.149859] dump_stack+0xe2/0x186 [40226.176546] __warn+0x1a4/0x1e0 [40226.180066] warn_slowpath_null+0x31/0x40 [40226.184555] inet_sock_destruct+0x78d/0x9a0 [40226.246355] __sk_destruct+0xfa/0x8c0 [40226.290612] rcu_process_callbacks+0xaa0/0x18a0 [40226.336816] __do_softirq+0x241/0x75e [40226.367758] irq_exit+0x1f6/0x220 [40226.371458] smp_apic_timer_interrupt+0x7b/0xa0 [40226.376507] apic_timer_interrupt+0x93/0xa0 The warn_on happned when sk->sk_rmem_alloc wasn't 0 in inet_sock_destruct. As after commit f970bd9e3a06 ("udp: implement memory accounting helpers"), udp has changed to use udp_destruct_sock as sk_destruct where it would udp_rmem_release all rmem. But IPV6_ADDRFORM sockopt sets sk_destruct with inet_sock_destruct after changing family to PF_INET. If rmem is not 0 at that time, and there is no place to release rmem before calling inet_sock_destruct, the warn_on will be triggered. This patch is to fix it by not setting sk_destruct in IPV6_ADDRFORM sockopt any more. As IPV6_ADDRFORM sockopt only works for tcp and udp. TCP sock has already set it's sk_destruct with inet_sock_destruct and UDP has set with udp_destruct_sock since they're created. Fixes: f970bd9e3a06 ("udp: implement memory accounting helpers") Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29addrlabel: add/delete/get can run without rtnlFlorian Westphal
There appears to be no need to use rtnl, addrlabel entries are refcounted and add/delete is serialized by the addrlabel table spinlock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-08-29 1) Fix dst_entry refcount imbalance when using socket policies. From Lorenzo Colitti. 2) Fix locking when adding the ESP trailers. 3) Fix tailroom calculation for the ESP trailer by using skb_tailroom instead of skb_availroom. 4) Fix some info leaks in xfrm_user. From Mathias Krause. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28ipv6: set dst.obsolete when a cached route has expiredXin Long
Now it doesn't check for the cached route expiration in ipv6's dst_ops->check(), because it trusts dst_gc that would clean the cached route up when it's expired. The problem is in dst_gc, it would clean the cached route only when it's refcount is 1. If some other module (like xfrm) keeps holding it and the module only release it when dst_ops->check() fails. But without checking for the cached route expiration, .check() may always return true. Meanwhile, without releasing the cached route, dst_gc couldn't del it. It will cause this cached route never to expire. This patch is to set dst.obsolete with DST_OBSOLETE_KILL in .gc when it's expired, and check obsolete != DST_OBSOLETE_FORCE_CHK in .check. Note that this is even needed when ipv6 dst_gc timer is removed one day. It would set dst.obsolete in .redirect and .update_pmtu instead, and check for cached route expiration when getting it, just like what ipv4 route does. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28ipv6: fix sparse warning on rt6i_nodeWei Wang
Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This generates a new sparse warning on rt->rt6i_node related code: net/ipv6/route.c:1394:30: error: incompatible types in comparison expression (different address spaces) ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison expression (different address spaces) This commit adds "__rcu" tag for rt6i_node and makes sure corresponding rcu API is used for it. After this fix, sparse no longer generates the above warning. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28net: Add comment that early_demux can change via sysctlDavid Ahern
Twice patches trying to constify inet{6}_protocol have been reverted: 39294c3df2a8 ("Revert "ipv6: constify inet6_protocol structures"") to revert 3a3a4e3054137 and then 03157937fe0b5 ("Revert "ipv4: make net_protocol const"") to revert aa8db499ea67. Add a comment that the structures can not be const because the early_demux field can change based on a sysctl. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28netfilter: conntrack: don't log "invalid" icmpv6 connectionsFlorian Westphal
When enabling logging for invalid connections we currently also log most icmpv6 types, which we don't track intentionally (e.g. neigh discovery). "invalid" should really mean "invalid", i.e. short header or bad checksum. We don't do any logging for icmp(v4) either, its just useless noise. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-25udp6: set rx_dst_cookie on rx_dst updatesPaolo Abeni
Currently, in the udp6 code, the dst cookie is not initialized/updated concurrently with the RX dst used by early demux. As a result, the dst_check() in the early_demux path always fails, the rx dst cache is always invalidated, and we can't really leverage significant gain from the demux lookup. Fix it adding udp6 specific variant of sk_rx_dst_set() and use it to set the dst cookie when the dst entry is really changed. The issue is there since the introduction of early demux for ipv6. Fixes: 5425077d73e0 ("net: ipv6: Add early demux handler for UDP unicast") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: implement additional seg6local actionsDavid Lebrun
This patch implements the following seg6local actions. - SEG6_LOCAL_ACTION_END_T: regular SRH processing and forward to the next-hop looked up in the specified routing table. - SEG6_LOCAL_ACTION_END_DX2: decapsulate an L2 frame and forward it to the specified network interface. - SEG6_LOCAL_ACTION_END_DX4: decapsulate an IPv4 packet and forward it, possibly to the specified next-hop. - SEG6_LOCAL_ACTION_END_DT6: decapsulate an IPv6 packet and forward it to the next-hop looked up in the specified routing table. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add helper functions for seg6localDavid Lebrun
This patch adds three helper functions to be used with the seg6local packet processing actions. The decap_and_validate() function will be used by the End.D* actions, that decapsulate an SR-enabled packet. The advance_nextseg() function applies the fundamental operations to update an SRH for the next segment. The lookup_nexthop() function helps select the next-hop for the processed SR packets. It supports an optional next-hop address to route the packet specifically through it, and an optional routing table to use. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: enforce IPv6 packets for seg6local lwtDavid Lebrun
This patch ensures that the seg6local lightweight tunnel is used solely with IPv6 routes and processes only IPv6 packets. Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for encapsulation of L2 framesDavid Lebrun
This patch implements the L2 frame encapsulation mechanism, referred to as T.Encaps.L2 in the SRv6 specifications [1]. A new type of SRv6 tunnel mode is added (SEG6_IPTUN_MODE_L2ENCAP). It only accepts packets with an existing MAC header (i.e., it will not work for locally generated packets). The resulting packet looks like IPv6 -> SRH -> Ethernet -> original L3 payload. The next header field of the SRH is set to NEXTHDR_NONE. [1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: sr: add support for ip4ip6 encapsulationDavid Lebrun
This patch enables the SRv6 encapsulation mode to carry an IPv4 payload. All the infrastructure was already present, I just had to add a parameter to seg6_do_srh_encap() to specify the inner packet protocol, and perform some additional checks. Usage example: ip route add 1.2.3.4 encap seg6 mode encap segs fc00::1,fc00::2 dev eth0 Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25ipv6: Fix may be used uninitialized warning in rt6_checkSteffen Klassert
rt_cookie might be used uninitialized, fix this by initializing it. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25esp: Fix skb tailroom calculationSteffen Klassert
We use skb_availroom to calculate the skb tailroom for the ESP trailer. skb_availroom calculates the tailroom and subtracts this value by reserved_tailroom. However reserved_tailroom is a union with the skb mark. This means that we subtract the tailroom by the skb mark if set. Fix this by using skb_tailroom instead. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-08-25esp: Fix locking on page fragment allocationSteffen Klassert
We allocate the page fragment for the ESP trailer inside a spinlock, but consume it outside of the lock. This is racy as some other cou could get the same page fragment then. Fix this by consuming the page fragment inside the lock too. Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible") Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-08-24ipv6: Use multipath hash from flow info if availableJakub Sitnicki
Allow our callers to influence the choice of ECMP link by honoring the hash passed together with the flow info. This allows for special treatment of ICMP errors which we would like to route over the same path as the IPv6 datagram that triggered the error. Also go through rt6_multipath_hash(), in the usual case when we aren't dealing with an ICMP error, so that there is one central place where multipath hash is computed. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Fold rt6_info_hash_nhsfn() into its only callerJakub Sitnicki
Commit 644d0e656958 ("ipv6 Use get_hash_from_flowi6 for rt6 hash") has turned rt6_info_hash_nhsfn() into a one-liner, so it no longer makes sense to keep it around. Also remove the accompanying comment that has become outdated. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Compute multipath hash for ICMP errors from offending packetJakub Sitnicki
When forwarding or sending out an ICMPv6 error, look at the embedded packet that triggered the error and compute a flow hash over its headers. This let's us route the ICMP error together with the flow it belongs to when multipath (ECMP) routing is in use, which in turn makes Path MTU Discovery work in ECMP load-balanced or anycast setups (RFC 7690). Granted, end-hosts behind the ECMP router (aka servers) need to reflect the IPv6 Flow Label for PMTUD to work. The code is organized to be in parallel with ipv4 stack: ip_multipath_l3_keys -> ip6_multipath_l3_keys fib_multipath_hash -> rt6_multipath_hash Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24ipv6: Add sysctl for per namespace flow label reflectionJakub Sitnicki
Reflecting IPv6 Flow Label at server nodes is useful in environments that employ multipath routing to load balance the requests. As "IPv6 Flow Label Reflection" standard draft [1] points out - ICMPv6 PTB error messages generated in response to a downstream packets from the server can be routed by a load balancer back to the original server without looking at transport headers, if the server applies the flow label reflection. This enables the Path MTU Discovery past the ECMP router in load-balance or anycast environments where each server node is reachable by only one path. Introduce a sysctl to enable flow label reflection per net namespace for all newly created sockets. Same could be earlier achieved only per socket by setting the IPV6_FL_F_REFLECT flag for the IPV6_FLOWLABEL_MGR socket option. [1] https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24netfilter: conntrack: place print_tuple in procfs partFlorian Westphal
CONFIG_NF_CONNTRACK_PROCFS is deprecated, no need to use a function pointer in the trackers for this. Place the printf formatting in the one place that uses it. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24netfilter: conntrack: remove protocol name from l4proto structFlorian Westphal
no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24netfilter: conntrack: remove protocol name from l3proto structFlorian Westphal
no need to waste storage for something that is only needed in one place and can be deduced from protocol number. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-24netfilter: conntrack: compute l3proto nla size at compile timeFlorian Westphal
avoids a pointer and allows struct to be const later on. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-08-23tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsgMike Maloney
When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the timestamp corresponding to the highest sequence number data returned. Previously the skb->tstamp is overwritten when a TCP packet is placed in the out of order queue. While the packet is in the ooo queue, save the timestamp in the TCB_SKB_CB. This space is shared with the gso_* options which are only used on the tx path, and a previously unused 4 byte hole. When skbs are coalesced either in the sk_receive_queue or the out_of_order_queue always choose the timestamp of the appended skb to maintain the invariant of returning the timestamp of the last byte in the recvmsg buffer. Signed-off-by: Mike Maloney <maloney@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22ipv6: add rcu grace period before freeing fib6_nodeWei Wang
We currently keep rt->rt6i_node pointing to the fib6_node for the route. And some functions make use of this pointer to dereference the fib6_node from rt structure, e.g. rt6_check(). However, as there is neither refcount nor rcu taken when dereferencing rt->rt6i_node, it could potentially cause crashes as rt->rt6i_node could be set to NULL by other CPUs when doing a route deletion. This patch introduces an rcu grace period before freeing fib6_node and makes sure the functions that dereference it takes rcu_read_lock(). Note: there is no "Fixes" tag because this bug was there in a very early stage. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-08-21 1) Fix memleaks when ESP takes an error path. 2) Fix null pointer dereference when creating a sub policy that matches the same outer flow as main policy does. From Koichiro Den. 3) Fix possible out-of-bound access in xfrm_migrate. This patch should go to the stable trees too. From Vladis Dronov. 4) ESP can return positive and negative error values, so treat both cases as an error. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22ipv6: accept 64k - 1 packet length in ip6_find_1stfragopt()Stefano Brivio
A packet length of exactly IPV6_MAXPLEN is allowed, we should refuse parsing options only if the size is 64KiB or more. While at it, remove one extra variable and one assignment which were also introduced by the commit that introduced the size check. Checking the sum 'offset + len' and only later adding 'len' to 'offset' doesn't provide any advantage over directly summing to 'offset' and checking it. Fixes: 6399f1fae4ec ("ipv6: avoid overflow of offset in ip6_find_1stfragopt") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-08-21net: ipv6: put host and anycast routes on device with addressDavid Ahern
One nagging difference between ipv4 and ipv6 is host routes for ipv6 addresses are installed using the loopback device or VRF / L3 Master device. e.g., 2001:db8:1::/120 dev veth0 proto kernel metric 256 pref medium local 2001:db8:1::1 dev lo table local proto kernel metric 0 pref medium Using the loopback device is convenient -- necessary for local tx, but has some nasty side effects, most notably setting the 'lo' device down causes all host routes for all local IPv6 address to be removed from the FIB and completely breaks IPv6 networking across all interfaces. This patch puts FIB entries for IPv6 routes against the device. This simplifies the routes in the FIB, for example by making dst->dev and rt6i_idev->dev the same (a future patch can look at removing the device reference taken for rt6i_idev for FIB entries). When copies are made on FIB lookups, the cloned route has dst->dev set to loopback (or the L3 master device). This is needed for the local Tx of packets to local addresses. With fib entries allocated against the real network device, the addrconf code that reinserts host routes on admin up of 'lo' is no longer needed. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-08-21 1) Support RX checksum with IPsec crypto offload for esp4/esp6. From Ilan Tayari. 2) Fixup IPv6 checksums when doing IPsec crypto offload. From Yossi Kuperman. 3) Auto load the xfrom offload modules if a user installs a SA that requests IPsec offload. From Ilan Tayari. 4) Clear RX offload informations in xfrm_input to not confuse the TX path with stale offload informations. From Ilan Tayari. 5) Allow IPsec GSO for local sockets if the crypto operation will be offloaded. 6) Support setting of an output mark to the xfrm_state. This mark can be used to to do the tunnel route lookup. From Lorenzo Colitti. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-20ipv6: repair fib6 tree in failure caseWei Wang
In fib6_add(), it is possible that fib6_add_1() picks an intermediate node and sets the node's fn->leaf to NULL in order to add this new route. However, if fib6_add_rt2node() fails to add the new route for some reason, fn->leaf will be left as NULL and could potentially cause crash when fn->leaf is accessed in fib6_locate(). This patch makes sure fib6_repair_tree() is called to properly repair fn->leaf in the above failure case. Here is the syzkaller reported general protection fault in fib6_locate: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801d7d64100 ti: ffff8801d01a0000 task.ti: ffff8801d01a0000 RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline] RIP: 0010:[<ffffffff82a3e0e1>] [<ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233 RSP: 0018:ffff8801d01a36a8 EFLAGS: 00010202 RAX: 0000000000000020 RBX: ffff8801bc790e00 RCX: ffffc90002983000 RDX: 0000000000001219 RSI: ffff8801d01a37a0 RDI: 0000000000000100 RBP: ffff8801d01a36f0 R08: 00000000000000ff R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001 R13: dffffc0000000000 R14: ffff8801d01a37a0 R15: 0000000000000000 FS: 00007f6afd68c700(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004c6340 CR3: 00000000ba41f000 CR4: 00000000001426f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0 ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0 ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988 Call Trace: [<ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109 [<ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075 [<ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450 [<ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281 [<ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456 [<ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline] [<ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232 [<ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778 [<ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline] [<ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619 [<ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834 [<ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478 [<ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491 [<ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538 [<ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline] [<ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577 [<ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17 Note: there is no "Fixes" tag as this seems to be a bug introduced very early. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18ipv6: reset fn->rr_ptr when replacing routeWei Wang
syzcaller reported the following use-after-free issue in rt6_select(): BUG: KASAN: use-after-free in rt6_select net/ipv6/route.c:755 [inline] at addr ffff8800bc6994e8 BUG: KASAN: use-after-free in ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084 at addr ffff8800bc6994e8 Read of size 4 by task syz-executor1/439628 CPU: 0 PID: 439628 Comm: syz-executor1 Not tainted 4.3.5+ #8 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 0000000000000000 ffff88018fe435b0 ffffffff81ca384d ffff8801d3588c00 ffff8800bc699380 ffff8800bc699500 dffffc0000000000 ffff8801d40a47c0 ffff88018fe435d8 ffffffff81735751 ffff88018fe43660 ffff8800bc699380 Call Trace: [<ffffffff81ca384d>] __dump_stack lib/dump_stack.c:15 [inline] [<ffffffff81ca384d>] dump_stack+0xc1/0x124 lib/dump_stack.c:51 sctp: [Deprecated]: syz-executor0 (pid 439615) Use of struct sctp_assoc_value in delayed_ack socket option. Use struct sctp_sack_info instead [<ffffffff81735751>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158 [<ffffffff817359c4>] print_address_description mm/kasan/report.c:196 [inline] [<ffffffff817359c4>] kasan_report_error+0x1b4/0x4a0 mm/kasan/report.c:285 [<ffffffff81735d93>] kasan_report mm/kasan/report.c:305 [inline] [<ffffffff81735d93>] __asan_report_load4_noabort+0x43/0x50 mm/kasan/report.c:325 [<ffffffff82a28e39>] rt6_select net/ipv6/route.c:755 [inline] [<ffffffff82a28e39>] ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084 [<ffffffff82a28fb1>] ip6_pol_route_output+0x81/0xb0 net/ipv6/route.c:1203 [<ffffffff82ab0a50>] fib6_rule_action+0x1f0/0x680 net/ipv6/fib6_rules.c:95 [<ffffffff8265cbb6>] fib_rules_lookup+0x2a6/0x7a0 net/core/fib_rules.c:223 [<ffffffff82ab1430>] fib6_rule_lookup+0xd0/0x250 net/ipv6/fib6_rules.c:41 [<ffffffff82a22006>] ip6_route_output+0x1d6/0x2c0 net/ipv6/route.c:1224 [<ffffffff829e83d2>] ip6_dst_lookup_tail+0x4d2/0x890 net/ipv6/ip6_output.c:943 [<ffffffff829e889a>] ip6_dst_lookup_flow+0x9a/0x250 net/ipv6/ip6_output.c:1079 [<ffffffff82a9f7d8>] ip6_datagram_dst_update+0x538/0xd40 net/ipv6/datagram.c:91 [<ffffffff82aa0978>] __ip6_datagram_connect net/ipv6/datagram.c:251 [inline] [<ffffffff82aa0978>] ip6_datagram_connect+0x518/0xe50 net/ipv6/datagram.c:272 [<ffffffff82aa1313>] ip6_datagram_connect_v6_only+0x63/0x90 net/ipv6/datagram.c:284 [<ffffffff8292f790>] inet_dgram_connect+0x170/0x1f0 net/ipv4/af_inet.c:564 [<ffffffff82565547>] SYSC_connect+0x1a7/0x2f0 net/socket.c:1582 [<ffffffff8256a649>] SyS_connect+0x29/0x30 net/socket.c:1563 [<ffffffff82c72032>] entry_SYSCALL_64_fastpath+0x12/0x17 Object at ffff8800bc699380, in cache ip6_dst_cache size: 384 The root cause of it is that in fib6_add_rt2node(), when it replaces an existing route with the new one, it does not update fn->rr_ptr. This commit resets fn->rr_ptr to NULL when it points to a route which is replaced in fib6_add_rt2node(). Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18datagram: When peeking datagrams with offset < 0 don't skip empty skbsMatthew Dawson
Due to commit e6afc8ace6dd5cef5e812f26c72579da8806f5ac ("udp: remove headers from UDP packets before queueing"), when udp packets are being peeked the requested extra offset is always 0 as there is no need to skip the udp header. However, when the offset is 0 and the next skb is of length 0, it is only returned once. The behaviour can be seen with the following python script: from socket import *; f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0); f.bind(('::', 0)); addr=('::1', f.getsockname()[1]); g.sendto(b'', addr) g.sendto(b'b', addr) print(f.recvfrom(10, MSG_PEEK)); print(f.recvfrom(10, MSG_PEEK)); Where the expected output should be the empty string twice. Instead, make sk_peek_offset return negative values, and pass those values to __skb_try_recv_datagram/__skb_try_recv_from_queue. If the passed offset to __skb_try_recv_from_queue is negative, the checked skb is never skipped. __skb_try_recv_from_queue will then ensure the offset is reset back to 0 if a peek is requested without an offset, unless no packets are found. Also simplify the if condition in __skb_try_recv_from_queue. If _off is greater then 0, and off is greater then or equal to skb->len, then (_off || skb->len) must always be true assuming skb->len >= 0 is always true. Also remove a redundant check around a call to sk_peek_offset in af_unix.c, as it double checked if MSG_PEEK was set in the flags. V2: - Moved the negative fixup into __skb_try_recv_from_queue, and remove now redundant checks - Fix peeking in udp{,v6}_recvmsg to report the right value when the offset is 0 V3: - Marked new branch in __skb_try_recv_from_queue as unlikely. Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>