summaryrefslogtreecommitdiff
path: root/include/net/route.h
AgeCommit message (Collapse)Author
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11net: vrf: Fix dst reference countingDavid Ahern
Vivek reported a kernel exception deleting a VRF with an active connection through it. The root cause is that the socket has a cached reference to a dst that is destroyed. Converting the dst_destroy to dst_release and letting proper reference counting kick in does not work as the dst has a reference to the device which needs to be released as well. I talked to Hannes about this at netdev and he pointed out the ipv4 and ipv6 dst handling has dst_ifdown for just this scenario. Rather than continuing with the reinvented dst wheel in VRF just remove it and leverage the ipv4 and ipv6 versions. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07net: Checks skb_dst to be NULL in inet_iifTom Herbert
In inet_iif check if skb_rtable is NULL for the skb and return skb->skb_iif if it is. This change allows inet_iif to be called before the dst information has been set in the skb (e.g. when doing socket based UDP GRO). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ipv4: Namespaceify ip_default_ttl sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04net: Propagate lookup failure in l3mdev_get_saddr to callerDavid Ahern
Commands run in a vrf context are not failing as expected on a route lookup: root@kenny:~# ip ro ls table vrf-red unreachable default root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 ping: Warning: source address might be selected on device other than vrf-red. PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data. --- 10.100.1.254 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Since the vrf table does not have a route for 10.100.1.254 the ping should have failed. The saddr lookup causes a full VRF table lookup. Propogating a lookup failure to the user allows the command to fail as expected: root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 connect: No route to host Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07net: Add source address lookup op for VRFDavid Ahern
Add operation to l3mdev to lookup source address for a given flow. Add support for the operation to VRF driver and convert existing IPv4 hooks to use the new lookup. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-07net: Rename FLOWI_FLAG_VRFSRC to FLOWI_FLAG_L3MDEV_SRCDavid Ahern
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05ipv4: ICMP packet inspection for multipathPeter Nørlund
ICMP packets are inspected to let them route together with the flow they belong to, minimizing the chance that a problematic path will affect flows on other paths, and so that anycast environments can work with ECMP. Signed-off-by: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29net: Move netif_index_is_l3_master to l3mdev.hDavid Ahern
Change CONFIG dependency to CONFIG_NET_L3_MASTER_DEV as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-29net: Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTERDavid Ahern
Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the netif_is_vrf and netif_index_is_vrf macros. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/ipv4/arp.c The net/ipv4/arp.c conflict was one commit adding a new local variable while another commit was deleting one. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25inet: constify ip_route_output_flow() socket argumentEric Dumazet
Very soon, TCP stack might call inet_csk_route_req(), which calls inet_csk_route_req() with an unlocked listener socket, so we need to make sure ip_route_output_flow() is not trying to change any field from its socket argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17net: Fix vti use case with oif in dst lookupsDavid Ahern
Steffen reported that the recent change to add oif to dst lookups breaks the VTI use case. The problem is that with the oif set in the flow struct the comparison to the nh_oif is triggered. Fix by splitting the FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare nh oif (FLOWI_FLAG_SKIP_NH_OIF). Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-15net: Add FIB table id to rtableDavid Ahern
Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-01net: Make table id type u32David Ahern
A number of VRF patches used 'int' for table id. It should be u32 to be consistent with the rest of the stack. Fixes: 4e3c89920cd3a ("net: Introduce VRF related flags and helpers") 15be405eb2ea9 ("net: Add inet_addr lookup by table") 30bbaa1950055 ("net: Fix up inet_addr_type checks") 021dd3b8a142d ("net: Add routes to the table associated with the device") dc028da54ed35 ("inet: Move VRF table lookup to inlined function") f6d3c19274c74 ("net: FIB tracepoints") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20route: move lwtunnel state to dst_entryJiri Benc
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: Fix up inet_addr_type checksDavid Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: Add inet_addr lookup by tableDavid Ahern
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13net: Use VRF device index for lookups on TXDavid Ahern
As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21ipv4: support for fib route lwtunnel encap attributesRoopa Prabhu
This patch adds support in ipv4 fib functions to parse user provided encap attributes and attach encap state data to fib_nh and rtable. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15ipv4: per cpu uncached listEric Dumazet
RAW sockets with hdrinc suffer from contention on rt_uncached_lock spinlock. One solution is to use percpu lists, since most routes are destroyed by the cpu that created them. It is unclear why we even have to put these routes in uncached_list, as all outgoing packets should be freed when a device is dismantled. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: caacf05e5ad1 ("ipv4: Properly purge netdev references on uncached routes.") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-24ipv4: remove ip_rt_dump from route.cLi RongQing
ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against ↵Hannes Frederic Sowa
pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06net: Remove FLOWI_FLAG_CAN_SLEEPSteffen Klassert
FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility to sleep until the needed states are resolved. This code is gone, so FLOWI_FLAG_CAN_SLEEP is not needed anymore. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-11-05ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACEHannes Frederic Sowa
Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-17ipv4: shrink rt_cache_statEric Dumazet
Half of the rt_cache_stat fields are no longer used after IP route cache removal, lets shrink this per cpu area. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-28ipv4: processing ancillary IP_TOS or IP_TTLFrancesco Fusco
If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out packets with the specified TTL or TOS overriding the socket values specified with the traditional setsockopt(). The struct inet_cork stores the values of TOS, TTL and priority that are passed through the struct ipcm_cookie. If there are user-specified TOS (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are used to override the per-socket values. In case of TOS also the priority is changed accordingly. Two helper functions get_rttos and get_rtconn_flags are defined to take into account the presence of a user specified TOS value when computing RT_TOS and RT_CONN_FLAGS. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-23route.h: Remove extern from function prototypesJoe Perches
There are a mix of function prototypes with and without extern in the kernel sources. Standardize on not using extern for function prototypes. Function prototypes don't need to be written with extern. extern is assumed by the compiler. Its use is as unnecessary as using auto to declare automatic/local variables in a block. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-14xfrm: introduce helper for safe determination of mtuHannes Frederic Sowa
skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2012-11-03ipv4: avoid a test in ip_rt_put()Eric Dumazet
We can save a test in ip_rt_put(), considering dst_release() accepts a NULL parameter, and dst is first element in rtable. Add a BUILD_BUG_ON() to catch any change that could break this assertion. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <amwang@redhat.com> Acked-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08ipv4: introduce rt_uses_gatewayJulian Anastasov
Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: Properly purge netdev references on uncached routes.David S. Miller
When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26ipv4: Fix input route performance regression.David S. Miller
With the routing cache removal we lost the "noref" code paths on input, and this can kill some routing workloads. Reinstate the noref path when we hit a cached route in the FIB nexthops. With help from Eric Dumazet. Reported-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-23ipv4: Change rt->rt_iif encoding.David S. Miller
On input packet processing, rt->rt_iif will be zero if we should use skb->dev->ifindex. Since we access rt->rt_iif consistently via inet_iif(), that is the only spot whose interpretation have to adjust. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill rt->fiDavid S. Miller
It's not really needed. We only grabbed a reference to the fib_info for the sake of fib_info local metrics. However, fib_info objects are freed using RCU, as are therefore their private metrics (if any). We would have triggered a route cache flush if we eliminated a reference to a fib_info object in the routing tables. Therefore, any existing cached routes will first check and see that they have been invalidated before an errant reference to these metric values would occur. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Turn rt->rt_route_iif into rt->rt_is_input.David S. Miller
That is this value's only use, as a boolean to indicate whether a route is an input route or not. So implement it that way, using a u16 gap present in the struct already. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill rt->rt_oifDavid S. Miller
Never actually used. It was being set on output routes to the original OIF specified in the flow key used for the lookup. Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of the flowi4_oif and flowi4_iif values, thanks to feedback from Julian Anastasov. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Adjust semantics of rt->rt_gateway.David S. Miller
In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
2012-07-20ipv4: Remove 'rt_dst' from 'struct rtable'David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Remove 'rt_mark' from 'struct rtable'David Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill 'rt_src' from 'struct rtable'David Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Remove rt_key_{src,dst,tos} from struct rtable.David Miller
They are always used in contexts where they can be reconstituted, or where the finally resolved rt->rt_{src,dst} is semantically equivalent. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Kill ip_route_input_noref().David Miller
The "noref" argument to ip_route_input_common() is now always ignored because we do not cache routes, and in that case we must always grab a reference to the resulting 'dst'. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-20ipv4: Delete routing cache.David S. Miller
The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Kill ip_rt_redirect().David S. Miller
No longer needed, as the protocol handlers now all properly propagate the redirect back into the routing code. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-11ipv4: Rearrange arguments to ip_rt_redirect()David S. Miller
Pass in the SKB rather than just the IP addresses, so that policy and other aspects can reside in ip_rt_redirect() rather then icmp_redirect(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv4: Remove inetpeer from routes.David S. Miller
No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10ipv4: Maintain redirect and PMTU info in struct rtable again.David S. Miller
Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>