summaryrefslogtreecommitdiff
path: root/net/ipv4/ip_tunnel_core.c
AgeCommit message (Collapse)Author
2021-01-08net: ip_tunnel: clean up endianness conversionsJulian Wiedmann
sparse complains about some harmless endianness issues: > net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16 > net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types) > net/ipv4/ip_tunnel_core.c:225:43: expected restricted __be16 [usertype] mtu > net/ipv4/ip_tunnel_core.c:225:43: got unsigned short [usertype] iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion when storing the MTU into the ICMPv4 packet. Use htons(), just like iptunnel_pmtud_build_icmpv6() does. > net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16 > net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types) > net/ipv4/ip_tunnel_core.c:248:35: expected unsigned short type > net/ipv4/ip_tunnel_core.c:248:35: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16 > net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types) > net/ipv4/ip_tunnel_core.c:341:35: expected unsigned short type > net/ipv4/ip_tunnel_core.c:341:35: got restricted __be16 [usertype] eth_header() wants the Ethertype in host-order, use the correct flavour of byte-order conversion. > net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer > net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:609:30: expected int type > net/ipv4/ip_tunnel_core.c:609:30: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:619:30: expected int type > net/ipv4/ip_tunnel_core.c:619:30: got restricted __be16 [usertype] > net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types) > net/ipv4/ip_tunnel_core.c:629:30: expected int type > net/ipv4/ip_tunnel_core.c:629:30: got restricted __be16 [usertype] The TUNNEL_* types are big-endian, so adjust the type of the local variable in ip_tun_parse_opts(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12Merge https://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09net: remove ip_tunnel_get_stats64Heiner Kallweit
After having migrated all users remove ip_tunnel_get_stats64(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 repliesStefano Brivio
Jianlin reports that a bridged IPv6 VXLAN endpoint, carrying IPv6 packets over a link with a PMTU estimation of exactly 1350 bytes, won't trigger ICMPv6 Packet Too Big replies when the encapsulated datagrams exceed said PMTU value. VXLAN over IPv6 adds 70 bytes of overhead, so an ICMPv6 reply indicating 1280 bytes as inner MTU would be legitimate and expected. This comes from an off-by-one error I introduced in checks added as part of commit 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets"), whose purpose was to prevent sending ICMPv6 Packet Too Big messages with an MTU lower than the smallest permissible IPv6 link MTU, i.e. 1280 bytes. In iptunnel_pmtud_check_icmpv6(), avoid triggering a reply only if the advertised MTU would be less than, and not equal to, 1280 bytes. Also fix the analogous comparison for IPv4, that is, skip the ICMP reply only if the resulting MTU is strictly less than 576 bytes. This becomes apparent while running the net/pmtu.sh bridged VXLAN or GENEVE selftests with adjusted lower-link MTU values. Using e.g. GENEVE, setting ll_mtu to the values reported below, in the test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() test function, we can see failures on the following tests: test | ll_mtu -------------------------------|-------- pmtu_ipv4_br_geneve4_exception | 626 pmtu_ipv6_br_geneve4_exception | 1330 pmtu_ipv6_br_geneve6_exception | 1350 owing to the different tunneling overheads implied by the corresponding configurations. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Link: https://lore.kernel.org/r/4f5fc2f33bfdf8409549fafd4f952b008bf04d63.1604681709.git.sbrivio@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13iptunnel: use new function dev_fetch_sw_netstatsHeiner Kallweit
Simplify the code by using new function dev_fetch_sw_netstats(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/050f9a83-b195-a3d6-edbd-91a59040be21@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-14lwtunnel: only keep the available bits when setting vxlan md->gbpXin Long
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can be set to or parse from the packet for vxlan gbp option. So do the mask when set it in lwtunnel, as it does in act_tunnel_key and cls_flower. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUMStefano Brivio
On architectures defining _HAVE_ARCH_IPV6_CSUM, we get csum_ipv6_magic() defined by means of arch checksum.h headers. On other architectures, we actually need to include net/ip6_checksum.h to be able to use it. Without this include, building with defconfig breaks at least for s390. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04tunnels: PMTU discovery support for directly bridged IP packetsStefano Brivio
It's currently possible to bridge Ethernet tunnels carrying IP packets directly to external interfaces without assigning them addresses and routes on the bridged network itself: this is the case for UDP tunnels bridged with a standard bridge or by Open vSwitch. PMTU discovery is currently broken with those configurations, because the encapsulation effectively decreases the MTU of the link, and while we are able to account for this using PMTU discovery on the lower layer, we don't have a way to relay ICMP or ICMPv6 messages needed by the sender, because we don't have valid routes to it. On the other hand, as a tunnel endpoint, we can't fragment packets as a general approach: this is for instance clearly forbidden for VXLAN by RFC 7348, section 4.3: VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may fragment encapsulated VXLAN packets due to the larger frame size. The destination VTEP MAY silently discard such VXLAN fragments. The same paragraph recommends that the MTU over the physical network accomodates for encapsulations, but this isn't a practical option for complex topologies, especially for typical Open vSwitch use cases. Further, it states that: Other techniques like Path MTU discovery (see [RFC1191] and [RFC1981]) MAY be used to address this requirement as well. Now, PMTU discovery already works for routed interfaces, we get route exceptions created by the encapsulation device as they receive ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and we already rebuild those messages with the appropriate MTU and route them back to the sender. Add the missing bits for bridged cases: - checks in skb_tunnel_check_pmtu() to understand if it's appropriate to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and RFC 4443 section 2.4 for ICMPv6. This function is already called by UDP tunnels - a new function generating those ICMP or ICMPv6 replies. We can't reuse icmp_send() and icmp6_send() as we don't see the sender as a valid destination. This doesn't need to be generic, as we don't cover any other type of ICMP errors given that we only provide an encapsulation function to the sender While at it, make the MTU check in skb_tunnel_check_pmtu() accurate: we might receive GSO buffers here, and the passed headroom already includes the inner MAC length, so we don't have to account for it a second time (that would imply three MAC headers on the wire, but there are just two). This issue became visible while bridging IPv6 packets with 4500 bytes of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50 bytes of encapsulation headroom, we would advertise MTU as 3950, and we would reject fragmented IPv6 datagrams of 3958 bytes size on the wire. We're exclusively dealing with network MTU here, though, so we could get Ethernet frames up to 3964 octets in that case. v2: - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern) - split IPv4/IPv6 functions (David Ahern) Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30net: ip_tunnel: add header_ops for layer 3 devicesJason A. Donenfeld
Some devices that take straight up layer 3 packets benefit from having a shared header_ops so that AF_PACKET sockets can inject packets that are recognized. This shared infrastructure will be used by other drivers that currently can't inject packets using AF_PACKET. It also exposes the parser function, as it is useful in standalone form too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: add net available in build_stateAlexander Aring
The build_state callback of lwtunnel doesn't contain the net namespace structure yet. This patch will add it so we can check on specific address configuration at creation time of rpl source routes. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21lwtunnel: check erspan options before allocating tun_infoXin Long
As Jakub suggested on another patch, it's better to do the check on erspan options before allocating memory. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTSXin Long
LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which are parsed by nla_parse_nested_deprecated(). We should check it strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS. This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy. Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20lwtunnel: add support for multiple geneve optsXin Long
geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts, so it's necessary for lwtunnel to support adding multiple geneve opts in one lwtunnel route. But vxlan and erspan opts are still only allowed to add one option. With this patch, iproute2 could make it like: # ip r a 1.1.1.0/24 encap ip id 1 geneve_opts 0:0:12121212,1:2:12121212 \ dst 10.1.0.2 dev geneve1 # ip r a 1.1.1.0/24 encap ip id 1 vxlan_opts 456 \ dst 10.1.0.2 dev erspan1 # ip r a 1.1.1.0/24 encap ip id 1 erspan_opts 1:123:0:0 \ dst 10.1.0.2 dev erspan1 Which are pretty much like cls_flower and act_tunnel_key. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-18lwtunnel: change to use nla_put_u8 for LWTUNNEL_IP_OPT_ERSPAN_VERXin Long
LWTUNNEL_IP_OPT_ERSPAN_VER is u8 type, and nla_put_u8 should have been used instead of nla_put_u32(). This is a copy-paste error. Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: ignore any TUNNEL_OPTIONS_PRESENT flags set by usersXin Long
TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT| TUNNEL_ERSPAN_OPT) flags should be set only according to tb[LWTUNNEL_IP_OPTS], which is done in ip_tun_parse_opts(). When setting info key.tun_flags, the TUNNEL_OPTIONS_PRESENT bits in tb[LWTUNNEL_IP(6)_FLAGS] passed from users should be ignored. While at it, replace all (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT| TUNNEL_ERSPAN_OPT) with 'TUNNEL_OPTIONS_PRESENT'. Fixes: 3093fbe7ff4b ("route: Per route IP tunnel metadata via lightweight tunnel") Fixes: 32a2b002ce61 ("ipv6: route: per route IP tunnel metadata via lightweight tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: get nlsize for erspan options properlyXin Long
erspan v1 has OPT_ERSPAN_INDEX while erspan v2 has OPT_ERSPAN_DIR and OPT_ERSPAN_HWID attributes, and they require different nlsize when dumping. So this patch is to get nlsize for erspan options properly according to erspan version. Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-11lwtunnel: change to use nla_parse_nested on new optionsXin Long
As the new options added in kernel, all should always use strict parsing from the beginning with nla_parse_nested(), instead of nla_parse_nested_deprecated(). Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan") Fixes: edf31cbb1502 ("lwtunnel: add options setting and dumping for vxlan") Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve") Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for erspanXin Long
Based on the code framework built on the last patch, to support setting and dumping for vxlan, we only need to add ip_tun_parse_opts_erspan() for .build_state and ip_tun_fill_encap_opts_erspan() for .fill_encap and if (tun_flags & TUNNEL_ERSPAN_OPT) for .get_encap_size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for vxlanXin Long
Based on the code framework built on the last patch, to support setting and dumping for vxlan, we only need to add ip_tun_parse_opts_vxlan() for .build_state and ip_tun_fill_encap_opts_vxlan() for .fill_encap and if (tun_flags & TUNNEL_VXLAN_OPT) for .get_encap_size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options setting and dumping for geneveXin Long
To add options setting and dumping, .build_state(), .fill_encap() and .get_encap_size() in ip_tun_lwt_ops needs to be extended: ip_tun_build_state(): ip_tun_parse_opts(): ip_tun_parse_opts_geneve() ip_tun_fill_encap_info(): ip_tun_fill_encap_opts(): ip_tun_fill_encap_opts_geneve() ip_tun_encap_nlsize() ip_tun_opts_nlsize(): if (tun_flags & TUNNEL_GENEVE_OPT) ip_tun_parse_opts(), ip_tun_fill_encap_opts() and ip_tun_opts_nlsize() processes LWTUNNEL_IP_OPTS. ip_tun_parse_opts_geneve(), ip_tun_fill_encap_opts_geneve() and if (tun_flags & TUNNEL_GENEVE_OPT) processes LWTUNNEL_IP_OPTS_GENEVE. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options process for cmp_encapXin Long
When comparing two tun_info, dst_cache member should have been skipped, as dst_cache is a per cpu pointer and they are always different values even in two tun_info with the same keys. So this patch is to skip dst_cache member and compare the key, mode and options_len only. For the future opts setting support, also to compare options. Fixes: 2d79849903e0 ("lwtunnel: ip tunnel: fix multiple routes with different encap") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06lwtunnel: add options process for arp requestXin Long
Without options copied to the dst tun_info in iptunnel_metadata_reply() called by arp_process for handling arp_request, the generated arp_reply packet may be dropped or sent out with wrong options for some tunnels like erspan and vxlan, and the traffic will break. Fixes: 63d008a4e9ee ("ipv4: send arp replies to the correct tunnel") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long
iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 269Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 51 franklin street fifth floor boston ma 02110 1301 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 21 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141334.228102212@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-27netlink: make validation more configurable for future strictnessJohannes Berg
We currently have two levels of strict validation: 1) liberal (default) - undefined (type >= max) & NLA_UNSPEC attributes accepted - attribute length >= expected accepted - garbage at end of message accepted 2) strict (opt-in) - NLA_UNSPEC attributes accepted - attribute length >= expected accepted Split out parsing strictness into four different options: * TRAILING - check that there's no trailing data after parsing attributes (in message or nested) * MAXTYPE - reject attrs > max known type * UNSPEC - reject attributes with NLA_UNSPEC policy entries * STRICT_ATTRS - strictly validate attribute size The default for future things should be *everything*. The current *_strict() is a combination of TRAILING and MAXTYPE, and is renamed to _deprecated_strict(). The current regular parsing has none of this, and is renamed to *_parse_deprecated(). Additionally it allows us to selectively set one of the new flags even on old policies. Notably, the UNSPEC flag could be useful in this case, since it can be arranged (by filling in the policy) to not be an incompatible userspace ABI change, but would then going forward prevent forgetting attribute entries. Similar can apply to the POLICY flag. We end up with the following renames: * nla_parse -> nla_parse_deprecated * nla_parse_strict -> nla_parse_deprecated_strict * nlmsg_parse -> nlmsg_parse_deprecated * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict * nla_parse_nested -> nla_parse_nested_deprecated * nla_validate_nested -> nla_validate_nested_deprecated Using spatch, of course: @@ expression TB, MAX, HEAD, LEN, POL, EXT; @@ -nla_parse(TB, MAX, HEAD, LEN, POL, EXT) +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression NLH, HDRLEN, TB, MAX, POL, EXT; @@ -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT) +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT) @@ expression TB, MAX, NLA, POL, EXT; @@ -nla_parse_nested(TB, MAX, NLA, POL, EXT) +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT) @@ expression START, MAX, POL, EXT; @@ -nla_validate_nested(START, MAX, POL, EXT) +nla_validate_nested_deprecated(START, MAX, POL, EXT) @@ expression NLH, HDRLEN, MAX, POL, EXT; @@ -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT) +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT) For this patch, don't actually add the strict, non-renamed versions yet so that it breaks compile if I get it wrong. Also, while at it, make nla_validate and nla_parse go down to a common __nla_validate_parse() function to avoid code duplication. Ultimately, this allows us to have very strict validation for every new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the next patch, while existing things will continue to work as is. In effect then, this adds fully strict validation for any new command. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24ip_tunnel: Add dst_cache support in lwtunnel_state of ip tunnelwenxu
The lwtunnel_state is not init the dst_cache Which make the ip_md_tunnel_xmit can't use the dst_cache. It will lookup route table every packets. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Pull in bug fixes before respinning my net-next pull request. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-24iptunnel: Set tun_flags in the iptunnel_metadata_reply from srcwenxu
ip l add tun type gretap external ip r a 10.0.0.2 encap ip id 1000 dst 172.168.0.2 key dev tun ip a a 10.0.0.1/24 dev tun The peer arp request to 10.0.0.1 with tunnel_id, but the arp reply only set the tun_id but not the tun_flags with TUNNEL_KEY. The arp reply packet don't contain tun_id field. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-11-17ip_tunnel: don't force DF when MTU is lockedSabrina Dubroca
The various types of tunnels running over IPv4 can ask to set the DF bit to do PMTU discovery. However, PMTU discovery is subject to the threshold set by the net.ipv4.route.min_pmtu sysctl, and is also disabled on routes with "mtu lock". In those cases, we shouldn't set the DF bit. This patch makes setting the DF bit conditional on the route's MTU locking state. This issue seems to be older than git history. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08ipv4/tunnel: use __vlan_hwaccel helpersMichał Mirosław
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10net/ipv4: Update ip_tunnel_metadata_cnt static key to modern apiDavidlohr Bueso
No changes in refcount semantics -- key init is false; replace static_key_slow_inc|dec with static_branch_inc|dec static_key_false with static_branch_unlikely Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25net: store port/representator id in metadata_dstJakub Kicinski
Switches and modern SR-IOV enabled NICs may multiplex traffic from Port representators and control messages over single set of hardware queues. Control messages and muxed traffic may need ordered delivery. Those requirements make it hard to comfortably use TC infrastructure today unless we have a way of attaching metadata to skbs at the upper device. Because single set of queues is used for many netdevs stopping TC/sched queues of all of them reliably is impossible and lower device has to retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on the fastpath. This patch attempts to enable port/representative devs to attach metadata to skbs which carry port id. This way representatives can be queueless and all queuing can be performed at the lower netdev in the usual way. Traffic arriving on the port/representative interfaces will be have metadata attached and will subsequently be queued to the lower device for transmission. The lower device should recognize the metadata and translate it to HW specific format which is most likely either a special header inserted before the network headers or descriptor/metadata fields. Metadata is associated with the lower device by storing the netdev pointer along with port id so that if TC decides to redirect or mirror the new netdev will not try to interpret it. This is mostly for SR-IOV devices since switches don't have lower netdevs today. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: add extack arg to lwtunnel build stateDavid Ahern
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13netlink: pass extended ACK struct to parsing functionsJohannes Berg
Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30lwtunnel: remove device arg to lwtunnel_build_stateDavid Ahern
Nothing about lwt state requires a device reference, so remove the input argument. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Two trivial overlapping changes conflicts in MPLS and mlx5. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Specify the owning module for lwtunnel opsRobert Shearman
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08net: make ndo_get_stats64 a void functionstephen hemminger
The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03ipv4: allow local fragmentation in ip_finish_output_gso()Lance Richardson
Some configurations (e.g. geneve interface with default MTU of 1500 over an ethernet interface with 1500 MTU) result in the transmission of packets that exceed the configured MTU. While this should be considered to be a "bad" configuration, it is still allowed and should not result in the sending of packets that exceed the configured MTU. Fix by dropping the assumption in ip_finish_output_gso() that locally originated gso packets will never need fragmentation. Basic testing using iperf (observing CPU usage and bandwidth) have shown no measurable performance impact for traffic not requiring fragmentation. Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack") Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09ip_tunnel: do not clear l4 hashesEric Dumazet
If skb has a valid l4 hash, there is no point clearing hash and force a further flow dissection when a tunnel encapsulation is added. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-22net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if ↵Shmulik Ladkani
their DF is unset In b8247f095e, "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs" gso skbs arriving from an ingress interface that go through UDP tunneling, are allowed to be fragmented if the resulting encapulated segments exceed the dst mtu of the egress interface. This aligned the behavior of gso skbs to non-gso skbs going through udp encapsulation path. However the non-gso vs gso anomaly is present also in the following cases of a GRE tunnel: - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set (e.g. OvS vport-gre with df_default=false) - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set In both of the above cases, the non-gso skbs get fragmented, whereas the gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped, as they don't go through the segment+fragment code path. Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set. Tunnels that do set IP_DF, will not go to fragmentation of segments. This preserves behavior of ip_gre in (the default) pmtudisc mode. Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs") Reported-by: wenxu <wenxu@ucloud.cn> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Tested-by: wenxu <wenxu@ucloud.cn> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow ↵Shmulik Ladkani
segmentation for local udp tunneled skbs Given: - tap0 and vxlan0 are bridged - vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400) Assume GSO skbs arriving from tap0 having a gso_size as determined by user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500). After encapsulation these skbs have skb_gso_network_seglen that exceed eth0's ip_skb_dst_mtu. These skbs are accidentally passed to ip_finish_output2 AS IS. Alas, each final segment (segmented either by validate_xmit_skb or by hardware UFO) would be larger than eth0 mtu. As a result, those above-mtu segments get dropped on certain networks. This behavior is not aligned with the NON-GSO case: Assume a non-gso 1500-sized IP packet arrives from tap0. After encapsulation, the vxlan datagram is fragmented normally at the ip_finish_output-->ip_fragment code path. The expected behavior for the GSO case would be segmenting the "gso-oversized" skb first, then fragmenting each segment according to dst mtu, and finally passing the resulting fragments to ip_finish_output2. 'ip_finish_output_gso' already supports this "Slowpath" behavior, according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4 forwarding (not set in the bridged case). In order to support the bridged case, we'll mark skbs arriving from an ingress interface that get udp-encaspulated as "allowed to be fragmented", causing their network_seglen to be validated by 'ip_finish_output_gso' (and fragment if needed). Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the gso and non-gso cases), which serves users wishing to forbid fragmentation at the udp tunnel endpoint. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20ip6_tun: Add infrastructure for doing encapsulationTom Herbert
Add encap_hlen and ip_tunnel_encap structure to ip6_tnl. Add functions for getting encap hlen, setting up encap on a tunnel, performing encapsulation operation. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20net: Cleanup encap items in ip_tunnels.hTom Herbert
Consolidate all the ip_tunnel_encap definitions in one spot in the header file. Also, move ip_encap_hlen and ip_tunnel_encap from ip_tunnel.c to ip_tunnels.h so they call be called without a dependency on ip_tunnel module. Similarly, move iptun_encaps to ip_tunnel_core.c. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03net: relax expensive skb_unclone() in iptunnel_handle_offloads()Eric Dumazet
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP tunnel have to go through an expensive skb_unclone() Reallocating skb->head is a lot of work. Test should really check if a 'real clone' of the packet was done. TCP does not care if the original gso_type is changed while the packet travels in the stack. This adds skb_header_unclone() which is a variant of skb_clone() using skb_header_cloned() check instead of skb_cloned(). This variant can probably be used from other points. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23libnl: nla_put_be64(): align on a 64-bit areaNicolas Dichtel
nla_data() is now aligned on a 64-bit area. A temporary version (nla_put_be64_32bit()) is added for nla_put_net64(). This function is removed in the next patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skbAlexander Duyck
This patch updates the IP tunnel core function iptunnel_handle_offloads so that we return an int and do not free the skb inside the function. This actually allows us to clean up several paths in several tunnels so that we can free the skb at one point in the path without having to have a secondary path if we are supporting tunnel offloads. In addition it should resolve some double-free issues I have found in the tunnels paths as I believe it is possible for us to end up triggering such an event in the case of fou or gue. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06ip_tunnel: implement __iptunnel_pull_headerJiri Benc
Allow calling of iptunnel_pull_header without special casing ETH_P_TEB inner protocol. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-02netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 addressHaishuang Yan
Since nla_get_in_addr and nla_put_in_addr were implemented, so use them appropriately. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>