summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2018-05-16ipv6/flowlabel: simplify pid namespace lookupChristoph Hellwig
The code should be using the pid namespace from the procfs mount instead of trying to look it up during open. Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/raw: simplify ѕeq_file codeChristoph Hellwig
Pass the hashtable to the proc private data instead of copying it into the per-file private data. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/ping: simplify proc file creationChristoph Hellwig
Remove the pointless ping_seq_afinfo indirection and make the code look like most other protocols. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/tcp: simplify procfs registrationChristoph Hellwig
Avoid most of the afinfo indirections and just call the proc helpers directly. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/udp{,lite}: simplify proc registrationChristoph Hellwig
Remove a couple indirections to make the code look like most other protocols. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16proc: introduce proc_create_single{,_data}Christoph Hellwig
Variants of proc_create{,_data} that directly take a seq_file show callback and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-14xfrm6: avoid potential infinite loop in _decode_session6()Eric Dumazet
syzbot found a way to trigger an infinitie loop by overflowing @offset variable that has been forced to use u16 for some very obscure reason in the past. We probably want to look at NEXTHDR_FRAGMENT handling which looks wrong, in a separate patch. In net-next, we shall try to use skb_header_pointer() instead of pskb_may_pull(). watchdog: BUG: soft lockup - CPU#1 stuck for 134s! [syz-executor738:4553] Modules linked in: irq event stamp: 13885653 hardirqs last enabled at (13885652): [<ffffffff878009d5>] restore_regs_and_return_to_kernel+0x0/0x2b hardirqs last disabled at (13885653): [<ffffffff87800905>] interrupt_entry+0xb5/0xf0 arch/x86/entry/entry_64.S:625 softirqs last enabled at (13614028): [<ffffffff84df0809>] tun_napi_alloc_frags drivers/net/tun.c:1478 [inline] softirqs last enabled at (13614028): [<ffffffff84df0809>] tun_get_user+0x1dd9/0x4290 drivers/net/tun.c:1825 softirqs last disabled at (13614032): [<ffffffff84df1b6f>] tun_get_user+0x313f/0x4290 drivers/net/tun.c:1942 CPU: 1 PID: 4553 Comm: syz-executor738 Not tainted 4.17.0-rc3+ #40 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:check_kcov_mode kernel/kcov.c:67 [inline] RIP: 0010:__sanitizer_cov_trace_pc+0x20/0x50 kernel/kcov.c:101 RSP: 0018:ffff8801d8cfe250 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: ffff8801d88a8080 RBX: ffff8801d7389e40 RCX: 0000000000000006 RDX: 0000000000000000 RSI: ffffffff868da4ad RDI: ffff8801c8a53277 RBP: ffff8801d8cfe250 R08: ffff8801d88a8080 R09: ffff8801d8cfe3e8 R10: ffffed003b19fc87 R11: ffff8801d8cfe43f R12: ffff8801c8a5327f R13: 0000000000000000 R14: ffff8801c8a4e5fe R15: ffff8801d8cfe3e8 FS: 0000000000d88940(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600400 CR3: 00000001acab3000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: _decode_session6+0xc1d/0x14f0 net/ipv6/xfrm6_policy.c:150 __xfrm_decode_session+0x71/0x140 net/xfrm/xfrm_policy.c:2368 xfrm_decode_session_reverse include/net/xfrm.h:1213 [inline] icmpv6_route_lookup+0x395/0x6e0 net/ipv6/icmp.c:372 icmp6_send+0x1982/0x2da0 net/ipv6/icmp.c:551 icmpv6_send+0x17a/0x300 net/ipv6/ip6_icmp.c:43 ip6_input_finish+0x14e1/0x1a30 net/ipv6/ip6_input.c:305 NF_HOOK include/linux/netfilter.h:288 [inline] ip6_input+0xe1/0x5e0 net/ipv6/ip6_input.c:327 dst_input include/net/dst.h:450 [inline] ip6_rcv_finish+0x29c/0xa10 net/ipv6/ip6_input.c:71 NF_HOOK include/linux/netfilter.h:288 [inline] ipv6_rcv+0xeb8/0x2040 net/ipv6/ip6_input.c:208 __netif_receive_skb_core+0x2468/0x3650 net/core/dev.c:4646 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4711 netif_receive_skb_internal+0x126/0x7b0 net/core/dev.c:4785 napi_frags_finish net/core/dev.c:5226 [inline] napi_gro_frags+0x631/0xc40 net/core/dev.c:5299 tun_get_user+0x3168/0x4290 drivers/net/tun.c:1951 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1996 call_write_iter include/linux/fs.h:1784 [inline] do_iter_readv_writev+0x859/0xa50 fs/read_write.c:680 do_iter_write+0x185/0x5f0 fs/read_write.c:959 vfs_writev+0x1c7/0x330 fs/read_write.c:1004 do_writev+0x112/0x2f0 fs/read_write.c:1039 __do_sys_writev fs/read_write.c:1112 [inline] __se_sys_writev fs/read_write.c:1109 [inline] __x64_sys_writev+0x75/0xb0 fs/read_write.c:1109 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: syzbot+0053c8...@syzkaller.appspotmail.com Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-05-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix handling of simultaneous open TCP connection in conntrack, from Jozsef Kadlecsik. 2) Insufficient sanitify check of xtables extension names, from Florian Westphal. 3) Skip unnecessary synchronize_rcu() call when transaction log is already empty, from Florian Westphal. 4) Incorrect destination mac validation in ebt_stp, from Stephen Hemminger. 5) xtables module reference counter leak in nft_compat, from Florian Westphal. 6) Incorrect connection reference counting logic in IPVS one-packet scheduler, from Julian Anastasov. 7) Wrong stats for 32-bits CPU in IPVS, also from Julian. 8) Calm down sparse error in netfilter core, also from Florian. 9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct and nfnetlink_cthelper, again from Florian. 10) Missing module alias in icmp and icmp6 xtables extensions, from Florian Westphal. 11) Base chain statistics in nf_tables may be unset/null, from Florian. 12) Fix handling of large matchinfo size in nft_compat, this includes one preparation for before this fix. From Florian. 13) Fix bogus EBUSY error when deleting chains due to incorrect reference counting from the preparation phase of the two-phase commit protocol. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The bpf syscall and selftests conflicts were trivial overlapping changes. The r8169 change involved moving the added mdelay from 'net' into a different function. A TLS close bug fix overlapped with the splitting of the TLS state into separate TX and RX parts. I just expanded the tests in the bug fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf == X". Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11erspan: auto detect truncated ipv6 packets.William Tu
Currently the truncated bit is set only when 1) the mirrored packet is larger than mtu and 2) the ipv4 packet tot_len is larger than the actual skb->len. This patch adds another case for detecting whether ipv6 packet is truncated or not, by checking the ipv6 header payload_len and the skb->len. Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11net/ipv6: Add fib lookup stubs for use in bpf helperDavid Ahern
Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table, a stub to do a lookup in a specific table, fib6_table_lookup, and a stub for a full route lookup. The stubs are needed for core bpf code to handle the case when the IPv6 module is not builtin. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Update fib6 tracepoint to take fib6_infoDavid Ahern
Similar to IPv4, IPv6 should use the FIB lookup result in the tracepoint. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Add fib6_lookupDavid Ahern
Add IPv6 equivalent to fib_lookup. Does a fib lookup, including rules, but returns a FIB entry, fib6_info, rather than a dst based rt6_info. fib6_lookup is any where from 140% (MULTIPLE_TABLES config disabled) to 60% faster than any of the dst based lookup methods (without custom rules) and 25% faster with custom rules (e.g., l3mdev rule). Since the lookup function has a completely different signature, fib6_rule_action is split into 2 paths: the existing one is renamed __fib6_rule_action and a new one for the fib6_info path is added. fib6_rule_action decides which to call based on the lookup_ptr. If it is fib6_table_lookup then the new path is taken. Caller must hold rcu lock as no reference is taken on the returned fib entry. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Refactor fib6_rule_actionDavid Ahern
Move source address lookup from fib6_rule_action to a helper. It will be used in a later patch by a second variant for fib6_rule_action. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Extract table lookup from ip6_pol_routeDavid Ahern
ip6_pol_route is used for ingress and egress FIB lookups. Refactor it moving the table lookup into a separate fib6_table_lookup that can be invoked separately and export the new function. ip6_pol_route now calls fib6_table_lookup and uses the result to generate a dst based rt6_info. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Rename rt6_multipath_selectDavid Ahern
Rename rt6_multipath_select to fib6_multipath_select and export it. A later patch wants access to it similar to IPv4's fib_select_path. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-11net/ipv6: Rename fib6_lookup to fib6_node_lookupDavid Ahern
Rename fib6_lookup to fib6_node_lookup to better reflect what it returns. The fib6_lookup name will be used in a later patch for an IPv6 equivalent to IPv4's fib_lookup. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-10tcp: Add mark for TIMEWAIT socketsJon Maxwell
This version has some suggestions by Eric Dumazet: - Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP races. - Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark" statement. - Factorize code as sk_fullsock() check is not necessary. Aidan McGurn from Openwave Mobility systems reported the following bug: "Marked routing is broken on customer deployment. Its effects are large increase in Uplink retransmissions caused by the client never receiving the final ACK to their FINACK - this ACK misses the mark and routes out of the incorrect route." Currently marks are added to sk_buffs for replies when the "fwmark_reflect" sysctl is enabled. But not for TW sockets that had sk->sk_mark set via setsockopt(SO_MARK..). Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. Then progate this so that the skb gets sent with the correct mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that netfilter rules are still honored. Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10net/ipv6: fix lock imbalance in ip6_route_del()Eric Dumazet
WARNING: lock held when returning to user space! 4.17.0-rc3+ #37 Not tainted syz-executor1/27662 is leaving the kernel with locks still held! 1 lock held by syz-executor1/27662: #0: 00000000f661aee7 (rcu_read_lock){....}, at: ip6_route_del+0xea/0x13f0 net/ipv6/route.c:3206 BUG: scheduling while atomic: syz-executor1/27662/0x00000002 INFO: lockdep is turned off. Modules linked in: Kernel panic - not syncing: scheduling while atomic CPU: 1 PID: 27662 Comm: syz-executor1 Not tainted 4.17.0-rc3+ #37 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __schedule_bug.cold.85+0xdf/0xdf kernel/sched/core.c:3290 schedule_debug kernel/sched/core.c:3307 [inline] __schedule+0x139e/0x1e30 kernel/sched/core.c:3412 schedule+0xef/0x430 kernel/sched/core.c:3549 exit_to_usermode_loop+0x220/0x310 arch/x86/entry/common.c:152 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x455979 RSP: 002b:00007fbf4051dc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: 0000000000000000 RBX: 00007fbf4051e6d4 RCX: 0000000000455979 RDX: 00000000200001c0 RSI: 000000000000890c RDI: 0000000000000013 RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000000003c8 R14: 00000000006f9b60 R15: 0000000000000000 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 23fb93a4d3f1 ("net/ipv6: Cleanup exception and cache route handling") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10udp: fix SO_BINDTODEVICEPaolo Abeni
Damir reported a breakage of SO_BINDTODEVICE for UDP sockets. In absence of VRF devices, after commit fb74c27735f0 ("net: ipv4: add second dif to udp socket lookups") the dif mismatch isn't fatal anymore for UDP socket lookup with non null sk_bound_dev_if, breaking SO_BINDTODEVICE semantics. This changeset addresses the issue making the dif match mandatory again in the above scenario. Reported-by: Damir Mansurov <dnman@oktetlabs.ru> Fixes: fb74c27735f0 ("net: ipv4: add second dif to udp socket lookups") Fixes: 1801b570dd2a ("net: ipv6: add second dif to udp socket lookups") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10net/udp: Update udp_encap_needed static key to modern apiDavidlohr Bueso
No changes in refcount semantics -- key init is false; replace static_key_enable with static_branch_enable static_key_slow_inc|dec with static_branch_inc|dec static_key_false with static_branch_unlikely Added a '_key' suffix to udp and udpv6 encap_needed, for better self documentation. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08udp: Add support for software checksum and GSO_PARTIAL with GSO offloadAlexander Duyck
This patch adds support for a software provided checksum and GSO_PARTIAL segmentation support. With this we can offload UDP segmentation on devices that only have partial support for tunnels. Since we are no longer needing the hardware checksum we can drop the checks in the segmentation code that were verifying if it was present. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08udp: Do not pass checksum as a parameter to GSO segmentationAlexander Duyck
This patch is meant to allow us to avoid having to recompute the checksum from scratch and have it passed as a parameter. Instead of taking that approach we can take advantage of the fact that the length that was used to compute the existing checksum is included in the UDP header. Finally to avoid the need to invert the result we can just call csum16_add and csum16_sub directly. By doing this we can avoid a number of instructions in the loop that is handling segmentation. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08udp: Do not pass MSS as parameter to GSO segmentationAlexander Duyck
There is no point in passing MSS as a parameter for for the GSO segmentation call as it is already available via the shared info for the skb itself. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-08netfilter: x_tables: add module alias for icmp matchesFlorian Westphal
The icmp matches are implemented in ip_tables and ip6_tables, respectively, so for normal iptables they are always available: those modules are loaded once iptables calls getsockopt() to fetch available module revisions. In iptables-over-nftables case probing occurs via nfnetlink, so these modules might not be loaded. Add aliases so modprobe can load these when icmp/icmp6 is requested. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-08trivial: fix inconsistent help textsGeorg Hofmann
This patch removes "experimental" from the help text where depends on CONFIG_EXPERIMENTAL was already removed. Signed-off-by: Georg Hofmann <georg@hofmannsweb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Minor conflict in ip_output.c, overlapping changes to the body of an if() statement. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-05-07 1) Always verify length of provided sadb_key to fix a slab-out-of-bounds read in pfkey_add. From Kevin Easton. 2) Make sure that all states are really deleted before we check that the state lists are empty. Otherwise we trigger a warning. 3) Fix MTU handling of the VTI6 interfaces on interfamily tunnels. From Stefano Brivio. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07net: ipv6/gre: Add GRO supportEran Ben Elisha
Add GRO capability for IPv6 GRE tunnel and ip6erspan tap, via gro_cells infrastructure. Performance testing: 55% higher badwidth. Measuring bandwidth of 1 thread IPv4 TCP traffic over IPv6 GRE tunnel while GRO on the physical interface is disabled. CPU: Intel Xeon E312xx (Sandy Bridge) NIC: Mellanox Technologies MT27700 Family [ConnectX-4] Before (GRO not working in tunnel) : 2.47 Gbits/sec After (GRO working in tunnel) : 3.85 Gbits/sec Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> CC: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07net: ipv6: Fix typo in ipv6_find_hdr() documentationTariq Toukan
Fix 'an' into 'and', and use a comma instead of a period. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree, more relevant updates in this batch are: 1) Add Maglev support to IPVS. Moreover, store lastest server weight in IPVS since this is needed by maglev, patches from from Inju Song. 2) Preparation works to add iptables flowtable support, patches from Felix Fietkau. 3) Hand over flows back to conntrack slow path in case of TCP RST/FIN packet is seen via new teardown state, also from Felix. 4) Add support for extended netlink error reporting for nf_tables. 5) Support for larger timeouts that 23 days in nf_tables, patch from Florian Westphal. 6) Always set an upper limit to dynamic sets, also from Florian. 7) Allow number generator to make map lookups, from Laura Garcia. 8) Use hash_32() instead of opencode hashing in IPVS, from Vicent Bernat. 9) Extend ip6tables SRH match to support previous, next and last SID, from Ahmed Abdelsalam. 10) Move Passive OS fingerprint nf_osf.c, from Fernando Fernandez. 11) Expose nf_conntrack_max through ctnetlink, from Florent Fourcot. 12) Several housekeeping patches for xt_NFLOG, x_tables and ebtables, from Taehee Yoo. 13) Unify meta bridge with core nft_meta, then make nft_meta built-in. Make rt and exthdr built-in too, again from Florian. 14) Missing initialization of tbl->entries in IPVS, from Cong Wang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-06netfilter: nf_nat: remove unused ct arg from lookup functionsFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-06netfilter: ip6t_srh: extend SRH matching for previous, next and last SIDAhmed Abdelsalam
IPv6 Segment Routing Header (SRH) contains a list of SIDs to be crossed by SR encapsulated packet. Each SID is encoded as an IPv6 prefix. When a Firewall receives an SR encapsulated packet, it should be able to identify which node previously processed the packet (previous SID), which node is going to process the packet next (next SID), and which node is the last to process the packet (last SID) which represent the final destination of the packet in case of inline SR mode. An example use-case of using these features could be SID list that includes two firewalls. When the second firewall receives a packet, it can check whether the packet has been processed by the first firewall or not. Based on that check, it decides to apply all rules, apply just subset of the rules, or totally skip all rules and forward the packet to the next SID. This patch extends SRH match to support matching previous SID, next SID, and last SID. Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-05-04net/ipv6: rename rt6_next to fib6_nextDavid Ahern
This slipped through the cracks in the followup set to the fib6_info flip. Rename rt6_next to fib6_next. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Overlapping changes in selftests Makefile. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-03ip6_gre: correct the function name in ip6gre_tnl_addr_conflict() commentSun Lianwen
The function name is wrong in ip6gre_tnl_addr_conflict() comment, which use ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict. Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-02ipv6: Revert "ipv6: Allow non-gateway ECMP for IPv6"Ido Schimmel
This reverts commit edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6"). Eric reported a division by zero in rt6_multipath_rebalance() which is caused by above commit that considers identical local routes to be siblings. The division by zero happens because a nexthop weight is not set for local routes. Revert the commit as it does not fix a bug and has side effects. To reproduce: # ip -6 address add 2001:db8::1/64 dev dummy0 # ip -6 address add 2001:db8::1/64 dev dummy1 Fixes: edd7ceb78296 ("ipv6: Allow non-gateway ECMP for IPv6") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01ipv6: Allow non-gateway ECMP for IPv6Thomas Winter
It is valid to have static routes where the nexthop is an interface not an address such as tunnels. For IPv4 it was possible to use ECMP on these routes but not for IPv6. Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> Cc: David Ahern <dsahern@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01udp: disable gso with no_check_txWillem de Bruijn
Syzbot managed to send a udp gso packet without checksum offload into the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This triggered the skb_warn_bad_offload. RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658 skb_gso_segment include/linux/netdevice.h:4038 [inline] validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120 __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577 dev_queue_xmit+0x17/0x20 net/core/dev.c:3618 UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to catch and fail in this case. After the integrity checks jump directly to the CHECKSUM_PARTIAL case to avoid reading the no_check_tx flags again (a TOCTTOU race). Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01ipv6: fix uninit-value in ip6_multipath_l3_keys()Eric Dumazet
syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(), root caused to a bad assumption of ICMP header being already pulled in skb->head ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug. BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683 ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884 ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:288 [inline] ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 call_write_iter include/linux/fs.h:1782 [inline] new_sync_write fs/read_write.c:469 [inline] __vfs_write+0x7fb/0x9f0 fs/read_write.c:482 vfs_write+0x463/0x8d0 fs/read_write.c:544 SYSC_write+0x172/0x360 fs/read_write.c:589 SyS_write+0x55/0x80 fs/read_write.c:581 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 23aebdacb05d ("ipv6: Compute multipath hash for ICMP errors from offending packet") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01change the comment of vti6_ioctlSun Lianwen
The comment of vti6_ioctl() is wrong. which use vti6_tnl_ioctl instead of vti6_ioctl. Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-30ipv6: sr: extract the right key values for "seg6_make_flowlabel"Ahmed Abdelsalam
The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the flowlabel from a given skb. It relies on skb_get_hash() which eventually calls __skb_flow_dissect() to extract the flow_keys struct values from the skb. In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(), skb_reset_network_header(), and skb_mac_header_rebuild() will results in flow_keys struct of all key values set to zero. This patch calls seg6_make_flowlabel() before resetting the headers of skb to get the right key values. Extracted Key values are based on the type inner packet as follows: 1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet. 2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port 3) L2 traffic: depends on what kind of traffic carried into the L2 frame. IPv6 and IPv4 traffic works as discussed 1) and 2) Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic 10.100.1.100: 47302 > 30.0.0.2: 5001 00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02 00000020: 0a 64 01 64 fc00:a1:a > b2::2 00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1 00000030: 00 00 00 00 00 00 00 00 00 00 00 0a Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-30erspan: auto detect truncated packets.William Tu
Currently the truncated bit is set only when the mirrored packet is larger than mtu. For certain cases, the packet might already been truncated before sending to the erspan tunnel. In this case, the patch detect whether the IP header's total length is larger than the actual skb->len. If true, this indicated that the mirrored packet is truncated and set the erspan truncate bit. I tested the patch using bpf_skb_change_tail helper function to shrink the packet size and send to erspan tunnel. Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receiveEric Dumazet
When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(&current->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27vti6: Change minimum MTU to IPV4_MIN_MTU, vti6 can carry IPv4 tooStefano Brivio
A vti6 interface can carry IPv4 as well, so it makes no sense to enforce a minimum MTU of IPV6_MIN_MTU. If the user sets an MTU below IPV6_MIN_MTU, IPv6 will be disabled on the interface, courtesy of addrconf_notify(). Reported-by: Xin Long <lucien.xin@gmail.com> Fixes: b96f9afee4eb ("ipv4/6: use core net MTU range checking") Fixes: c6741fbed6dc ("vti6: Properly adjust vti6 MTU from MTU of lower device") Fixes: 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-26udp: add gso segment cmsgWillem de Bruijn
Allow specifying segment size in the send call. The new control message performs the same function as socket option UDP_SEGMENT while avoiding the extra system call. [ Export udp_cmsg_send for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: paged allocation with gsoWillem de Bruijn
When sending large datagrams that are later segmented, store data in page frags to avoid copying from linear in skb_segment. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: generate gso with UDP_SEGMENTWillem de Bruijn
Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: add udp gsoWillem de Bruijn
Implement generic segmentation offload support for udp datagrams. A follow-up patch adds support to the protocol stack to generate such packets. UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits a large payload into a number of discrete UDP datagrams. The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it from UFO (SKB_UDP_GSO). IPPROTO_UDPLITE is excluded, as that protocol has no gso handler registered. [ Export __udp_gso_segment for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: expose inet cork to udpWillem de Bruijn
UDP segmentation offload needs access to inet_cork in the udp layer. Pass the struct to ip(6)_make_skb instead of allocating it on the stack in that function itself. This patch is a noop otherwise. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>