summaryrefslogtreecommitdiff
path: root/drivers/net/ipvlan/ipvlan_core.c
AgeCommit message (Collapse)Author
2021-01-28ipvlan: remove h from printk format specifierTom Rix
This change fixes the checkpatch warning described in this commit commit cbacb5ab0aa0 ("docs: printk-formats: Stop encouraging use of unnecessary %h[xudi] and %hh[xudi]") Standard integer promotion is already done and %hx and %hhx is useless so do not encourage the use of %hh[xudi] or %h[xudi]. Cleanup output to use __func__ over explicit function strings. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20210124190804.1964580-1-trix@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-03-09ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()Eric Dumazet
Commit e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog") added a cond_resched_rcu() in a loop using rcu protection to iterate over slaves. This is breaking rcu rules, so lets instead use cond_resched() at a point we can reschedule Fixes: e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09ipvlan: add cond_resched_rcu() while processing muticast backlogMahesh Bandewar
If there are substantial number of slaves created as simulated by Syzbot, the backlog processing could take much longer and result into the issue found in the Syzbot report. INFO: rcu_sched detected stalls on CPUs/tasks: (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752) All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root ->qsmask 0x0 syz-executor.1 R running task on cpu 1 10984 11210 3866 0x30020008 179034491270 Call Trace: <IRQ> [<ffffffff81497163>] _sched_show_task kernel/sched/core.c:8063 [inline] [<ffffffff81497163>] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030 [<ffffffff8146a91b>] sched_show_task+0xb/0x10 kernel/sched/core.c:8073 [<ffffffff815c931b>] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline] [<ffffffff815c931b>] check_cpu_stall kernel/rcu/tree.c:1695 [inline] [<ffffffff815c931b>] __rcu_pending kernel/rcu/tree.c:3478 [inline] [<ffffffff815c931b>] rcu_pending kernel/rcu/tree.c:3540 [inline] [<ffffffff815c931b>] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876 [<ffffffff815e3962>] update_process_times+0x32/0x80 kernel/time/timer.c:1635 [<ffffffff816164f0>] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161 [<ffffffff81616ae4>] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193 [<ffffffff815e75f7>] __run_hrtimer kernel/time/hrtimer.c:1393 [inline] [<ffffffff815e75f7>] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455 [<ffffffff815e90ea>] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513 [<ffffffff844050f4>] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline] [<ffffffff844050f4>] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056 [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778 RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153 RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12 RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000 RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0 RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273 R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8 R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0 [<ffffffff8101460e>] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline] [<ffffffff8101460e>] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240 [<ffffffff840d78ca>] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006 [<ffffffff84023439>] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482 [<ffffffff840211c8>] dst_input include/net/dst.h:449 [inline] [<ffffffff840211c8>] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78 [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:292 [inline] [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:286 [inline] [<ffffffff840214de>] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278 [<ffffffff83a29efa>] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303 [<ffffffff83a2a15c>] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417 [<ffffffff83a2f536>] process_backlog+0x216/0x6c0 net/core/dev.c:6243 [<ffffffff83a30d1b>] napi_poll net/core/dev.c:6680 [inline] [<ffffffff83a30d1b>] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748 [<ffffffff846002c8>] __do_softirq+0x2c8/0x99a kernel/softirq.c:317 [<ffffffff813e656a>] invoke_softirq kernel/softirq.c:399 [inline] [<ffffffff813e656a>] irq_exit+0x16a/0x1a0 kernel/softirq.c:439 [<ffffffff84405115>] exiting_irq arch/x86/include/asm/apic.h:561 [inline] [<ffffffff84405115>] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058 [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778 </IRQ> RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102 RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12 RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000 RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005 RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000 R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000 [<ffffffff816236d1>] do_futex+0x151/0x1d50 kernel/futex.c:3548 [<ffffffff816260f0>] C_SYSC_futex kernel/futex_compat.c:201 [inline] [<ffffffff816260f0>] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175 [<ffffffff8101da17>] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline] [<ffffffff8101da17>] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415 [<ffffffff84401a9b>] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7f23c69 RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0 RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1 rcu_sched R running task on cpu 1 13048 8 2 0x90000000 179099587640 Call Trace: [<ffffffff8147321f>] context_switch+0x60f/0xa60 kernel/sched/core.c:3209 [<ffffffff8100095a>] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934 [<ffffffff810021df>] schedule+0x8f/0x1b0 kernel/sched/core.c:4011 [<ffffffff8101116d>] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803 [<ffffffff815c13f1>] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327 [<ffffffff8144b318>] kthread+0x348/0x420 kernel/kthread.c:246 [<ffffffff84400266>] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393 Fixes: ba35f8588f47 (“ipvlan: Defer multicast / broadcast processing to a work-queue”) Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09ipvlan: don't deref eth hdr before checking it's setMahesh Bandewar
IPvlan in L3 mode discards outbound multicast packets but performs the check before ensuring the ether-header is set or not. This is an error that Eric found through code browsing. Fixes: 2ad7bf363841 (“ipvlan: Initial check-in of the IPVLAN driver.”) Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-08ipvlan: decouple l3s mode dependencies from other modesDaniel Borkmann
Right now ipvlan has a hard dependency on CONFIG_NETFILTER and otherwise it cannot be built. However, the only ipvlan operation mode that actually depends on netfilter is l3s, everything else is independent of it. Break this hard dependency such that users are able to use ipvlan l3 mode on systems where netfilter is not compiled in. Therefore, this adds a hidden CONFIG_IPVLAN_L3S bool which is defaulting to y when CONFIG_NETFILTER is set in order to retain existing behavior for l3s. All l3s related code is refactored into ipvlan_l3s.c that is compiled in when enabled. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net/ipv6: Pass skb to route lookupDavid Ahern
IPv6 does path selection for multipath routes deep in the lookup functions. The next patch adds L4 hash option and needs the skb for the forward path. To get the skb to the relevant FIB lookup functions it needs to go through the fib rules layer, so add a lookup_data argument to the fib_lookup_arg struct. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28ipvlan: use per device spinlock to protect addrs list updatesPaolo Abeni
This changeset moves ipvlan address under RCU protection, using a per ipvlan device spinlock to protect list mutation and RCU read access to protect list traversal. Also explicitly use RCU read lock to traverse the per port ipvlans list, so that we can now perform a full address lookup without asserting the RTNL lock. Overall this allows the ipvlan driver to check fully for duplicate addresses - before this commit ipv6 addresses assigned by autoconf via prefix delegation where accepted without any check - and avoid the following rntl assertion failure still in the same code path: RTNL: assertion failed at drivers/net/ipvlan/ipvlan_core.c (124) WARNING: CPU: 15 PID: 0 at drivers/net/ipvlan/ipvlan_core.c:124 ipvlan_addr_busy+0x97/0xa0 [ipvlan] Modules linked in: ipvlan(E) ixgbe CPU: 15 PID: 0 Comm: swapper/15 Tainted: G E 4.16.0-rc2.ipvlan+ #1782 Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016 RIP: 0010:ipvlan_addr_busy+0x97/0xa0 [ipvlan] RSP: 0018:ffff881ff9e03768 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff881fdf2a9000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00000000000000f6 RDI: 0000000000000300 RBP: ffff881fdf2a8000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff881ff9e034c0 R12: ffff881fe07bcc00 R13: 0000000000000001 R14: ffffffffa02002b0 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff881ff9e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc5c1a4f248 CR3: 000000207e012005 CR4: 00000000001606e0 Call Trace: <IRQ> ipvlan_addr6_event+0x6c/0xd0 [ipvlan] notifier_call_chain+0x49/0x90 atomic_notifier_call_chain+0x6a/0x100 ipv6_add_addr+0x5f9/0x720 addrconf_prefix_rcv_add_addr+0x244/0x3c0 addrconf_prefix_rcv+0x2f3/0x790 ndisc_router_discovery+0x633/0xb70 ndisc_rcv+0x155/0x180 icmpv6_rcv+0x4ac/0x5f0 ip6_input_finish+0x138/0x6a0 ip6_input+0x41/0x1f0 ipv6_rcv+0x4db/0x8d0 __netif_receive_skb_core+0x3d5/0xe40 netif_receive_skb_internal+0x89/0x370 napi_gro_receive+0x14f/0x1e0 ixgbe_clean_rx_irq+0x4ce/0x1020 [ixgbe] ixgbe_poll+0x31a/0x7a0 [ixgbe] net_rx_action+0x296/0x4f0 __do_softirq+0xcf/0x4f5 irq_exit+0xf5/0x110 do_IRQ+0x62/0x110 common_interrupt+0x91/0x91 </IRQ> v1 -> v2: drop unneeded in_softirq check in ipvlan_addr6_validator_event() Fixes: e9997c2938b2 ("ipvlan: fix check for IP addresses in control path") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-28ipvlan: egress mcast packets are not exceptionalPaolo Abeni
Currently, if IPv6 is enabled on top of an ipvlan device in l3 mode, the following warning message: Dropped {multi|broad}cast of type= [86dd] is emitted every time that a RS is generated and dmseg is soon filled with irrelevant messages. Replace pr_warn with pr_debug, to preserve debuggability, without scaring the sysadmin. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-21ipvlan: drop ipv6 dependencyMatteo Croce
IPVlan has an hard dependency on IPv6, refactor the ipvlan code to allow compiling it with IPv6 disabled, move duplicate code into addr_equal() and refactor series of if-else into a switch. Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15ipvlan: remove excessive packet scrubbingMahesh Bandewar
IPvlan currently scrubs packets at every location where packets may be crossing namespace boundary. Though this is desirable, currently IPvlan does it more than necessary. e.g. packets that are going to take dev_forward_skb() path will get scrubbed so no point in scrubbing them before forwarding. Another side-effect of scrubbing is that pkt-type gets set to PACKET_HOST which overrides what was already been set by the earlier path making erroneous delivery of the packets. Also scrubbing packets just before calling dev_queue_xmit() has detrimental effects since packets lose skb->sk and because of that miss prio updates, incorrect socket back-pressure and would even break TSQ. Fixes: b93dd49c1a35 ('ipvlan: Scrub skb before crossing the namespace boundary') Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15Revert "ipvlan: add L2 check for packets arriving via virtual devices"Mahesh Bandewar
This reverts commit 92ff42645028fa6f9b8aa767718457b9264316b4. Even though the check added is not that taxing, it's not really needed. First of all this will be per packet cost and second thing is that the eth_type_trans() already does this correctly. The excessive scrubbing in IPvlan was changing the pkt-type skb metadata of the packet which made it necessary to re-check the mac. The subsequent patch in this series removes the faulty packet-scrub. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11ipvlan: add L2 check for packets arriving via virtual devicesMahesh Bandewar
Packets that don't have dest mac as the mac of the master device should not be entertained by the IPvlan rx-handler. This is mostly true as the packet path mostly takes care of that, except when the master device is a virtual device. As demonstrated in the following case - ip netns add ns1 ip link add ve1 type veth peer name ve2 ip link add link ve2 name iv1 type ipvlan mode l2 ip link set dev iv1 netns ns1 ip link set ve1 up ip link set ve2 up ip -n ns1 link set iv1 up ip addr add 192.168.10.1/24 dev ve1 ip -n ns1 addr 192.168.10.2/24 dev iv1 ping -c2 192.168.10.2 <Works!> ip neigh show dev ve1 ip neigh show 192.168.10.2 lladdr <random> dev ve1 ping -c2 192.168.10.2 <Still works! Wrong!!> This patch adds that missing check in the IPvlan rx-handler. Reported-by: Amit Sikka <amit.sikka@ericsson.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06ipvlan: Eliminate duplicated codes with existing functionGao Feng
The recv flow of ipvlan l2 mode performs as same as l3 mode for non-multicast packet, so use the existing func ipvlan_handle_mode_l3 instead of these duplicated statements in non-multicast case. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-03ipvlan: Add the skb->mark as flow4's member to lookup routeGao Feng
Current codes don't use skb->mark to assign flowi4_mark, it would make the policy route rule with fwmark doesn't work as expected. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24ipvlan: Fix insufficient skb linear check for ipv6 icmpGao Feng
In the function ipvlan_get_L3_hdr, current codes use pskb_may_pull to make sure the skb header has enough linear room for ipv6 header. But it would use the latter memory directly without linear check when it is icmp. So it still may access the unepxected memory in ipvlan_addr_lookup. Now invoke the pskb_may_pull again if it is ipv6 icmp. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24ipvlan: Fix insufficient skb linear check for arpGao Feng
In the function ipvlan_get_L3_hdr, current codes use pskb_may_pull to make sure the skb header has enough linear room for arp header. But it would access the arp payload in func ipvlan_addr_lookup. So it still may access the unepxected memory. Now use arp_hdr_len(port->dev) instead of the arp header as the param. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11ipvlan: fix ipv6 outbound deviceKeefe Liu
When process the outbound packet of ipv6, we should assign the master device to output device other than input device. Signed-off-by: Keefe Liu <liuqifa@huawei.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29ipvlan: implement VEPA modeMahesh Bandewar
This is very similar to the Macvlan VEPA mode, however, there is some difference. IPvlan uses the mac-address of the lower device, so the VEPA mode has implications of ICMP-redirects for packets destined for its immediate neighbors sharing same master since the packets will have same source and dest mac. The external switch/router will send redirect msg. Having said that, this will be useful tool in terms of debugging since IPvlan will not switch packets within its slaves and rely completely on the external entity as intended in 802.1Qbg. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29ipvlan: introduce 'private' attribute for all existing modes.Mahesh Bandewar
IPvlan has always operated in bridge mode. However there are scenarios where each slave should be able to talk through the master device but not necessarily across each other. Think of an environment where each of a namespace is a private and independant customer. In this scenario the machine which is hosting these namespaces neither want to tell who their neighbor is nor the individual namespaces care to talk to neighbor on short-circuited network path. This patch implements the mode that is very similar to the 'private' mode in macvlan where individual slaves can send and receive traffic through the master device, just that they can not talk among slave devices. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11ipvtap: IP-VLAN based tap driverSainath Grandhi
This patch adds a tap character device driver that is based on the IP-VLAN network interface, called ipvtap. An ipvtap device can be created in the same way as an ipvlan device, using 'type ipvtap', and then accessed using the tap user space interface. Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28driver: ipvlan: Remove unnecessary ipvlan NULL check in ipvlan_count_rxGao Feng
There are three functions which would invoke the ipvlan_count_rx. They are ipvlan_process_multicast, ipvlan_rcv_frame, and ipvlan_nf_input. The former two functions already use the ipvlan directly before ipvlan_count_rx, and ipvlan_nf_input gets the ipvlan from ipvl_addr->master, it is not possible to be NULL too. So the ipvlan pointer check is unnecessary in ipvlan_count_rx. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23ipvlan: fix multicast processingMahesh Bandewar
In an IPvlan setup when master is set in loopback mode e.g. ethtool -K eth0 set loopback on where eth0 is master device for IPvlan setup. The failure is caused by the faulty logic that determines if the packet is from TX-path vs. RX-path by just looking at the mac- addresses on the packet while processing multicast packets. In the loopback-mode where this crash was happening, the packets that are sent out are reflected by the NIC and are processed on the RX path, but mac-address check tricks into thinking this packet is from TX path and falsely uses dev_forward_skb() to pass packets to the slave (virtual) devices. This patch records the path while queueing packets and eliminates logic of looking at mac-addresses for the same decision. ------------[ cut here ]------------ kernel BUG at include/linux/skbuff.h:1737! Call Trace: [<ffffffff921fbbc2>] dev_forward_skb+0x92/0xd0 [<ffffffffc031ac65>] ipvlan_process_multicast+0x395/0x4c0 [ipvlan] [<ffffffffc031a9a7>] ? ipvlan_process_multicast+0xd7/0x4c0 [ipvlan] [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660 [<ffffffff91cdff09>] process_one_work+0x1a9/0x660 [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660 [<ffffffff91ce086d>] worker_thread+0x11d/0x360 [<ffffffff91ce0750>] ? rescuer_thread+0x350/0x350 [<ffffffff91ce960b>] kthread+0xdb/0xe0 [<ffffffff91c05c70>] ? _raw_spin_unlock_irq+0x30/0x50 [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0 [<ffffffff92348b7a>] ret_from_fork+0x9a/0xd0 [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0 Fixes: ba35f8588f47 ("ipvlan: Defer multicast / broadcast processing to a work-queue") Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23ipvlan: fix various issues in ipvlan_process_multicast()Eric Dumazet
1) netif_rx() / dev_forward_skb() should not be called from process context. 2) ipvlan_count_rx() should be called with preemption disabled. 3) We should check if ipvlan->dev is up before feeding packets to netif_rx() 4) We need to prevent device from disappearing if some packets are in the multicast backlog. 5) One kfree_skb() should be a consume_skb() eventually Fixes: ba35f8588f47 ("ipvlan: Defer multicast / broadcast processing to a work-queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19ipvlan: Introduce l3s modeMahesh Bandewar
In a typical IPvlan L3 setup where master is in default-ns and each slave is into different (slave) ns. In this setup egress packet processing for traffic originating from slave-ns will hit all NF_HOOKs in slave-ns as well as default-ns. However same is not true for ingress processing. All these NF_HOOKs are hit only in the slave-ns skipping them in the default-ns. IPvlan in L3 mode is restrictive and if admins want to deploy iptables rules in default-ns, this asymmetric data path makes it impossible to do so. This patch makes use of the l3_rcv() (added as part of l3mdev enhancements) to perform input route lookup on RX packets without changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN to change the skb->dev just before handing over skb to L4. Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25ipvlan: Scrub skb before crossing the namespace boundryMahesh Bandewar
The earlier patch c3aaa06d5a63 (ipvlan: scrub skb before routing in L3 mode.) did this but only for TX path in L3 mode. This patch extends it for both the modes for TX/RX path. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21ipvlan: misc changesMahesh Bandewar
1. scope correction for few functions that are used in single file. 2. Adjust variables that are used in fast-path to fit into single cacheline 3. Update rcv_frame() to skip shared check for frames coming over wire Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21ipvlan: scrub skb before routing in L3 mode.Mahesh Bandewar
Scrub skb before hitting the iptable hooks to ensure packets hit these hooks. Set the xnet param only when the packet is crossing the ns boundry so if the IPvlan slave and master belong to the same ns, the param will be set to false. Signed-off-by: Mahesh Bandewar <maheshb@google.com> CC: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17ipvlan: fix use after free of skbSabrina Dubroca
ipvlan_handle_frame is a rx_handler, and when it returns a value other than RX_HANDLER_CONSUMED (here, NET_RX_DROP aka RX_HANDLER_ANOTHER), __netif_receive_skb_core expects that the skb still exists and will process it further, but we just freed it. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17ipvlan: fix leak in ipvlan_rcv_frameSabrina Dubroca
Pass a **skb to ipvlan_rcv_frame so that if skb_share_check returns a new skb, we actually use it during further processing. It's safe to ignore the new skb in the ipvlan_xmit_* functions, because they call ipvlan_rcv_frame with local == true, so that dev_forward_skb is called and always takes ownership of the skb. Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22ipvlan: read direct ifindex instead of iflinkBrenden Blanco
In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is being grabbed from dev_get_iflink. This works for the physical device case, since as the documentation of that function notes: "Physical interfaces have the same 'ifindex' and 'iflink' values.". However, if the master device is a veth, and the pairs are in separate net namespaces, the route lookup will fail with -ENODEV due to outer veth pair being in a separate namespace from the ipvlan master/routing namespace. ns0 | ns1 | ns2 veth0a--|--veth0b--|--ipvl0 In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above configuration will pass fl.flowi4_oif == veth0a to ip_route_output_flow(), but *net == ns1. Notice also that ipv6 processing is not using iflink. Since there is a discrepancy in usage, fixup both v4 and v6 case to use local dev variable. Tested this with l3 ipvlan on top of veth, as well as with single physical interface in the top namespace. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4, ipv6: Pass net into ip_local_out and ip6_local_outEric W. Biederman
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipvlan: Cache net in ipvlan_process_v4_outbound and ipvlan_process_v6_outboundEric W. Biederman
Compute net once in ipvlan_process_v4_outbound and ipvlan_process_v6_outbound and store it in a variable so that net does not need to be recomputed next time it is used. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv6: Merge ip6_local_out and ip6_local_out_skEric W. Biederman
Stop hidding the sk parameter with an inline helper function and make all of the callers pass it, so that it is clear what the function is doing. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08ipv4: Merge ip_local_out and ip_local_out_skEric W. Biederman
It is confusing and silly hiding a parameter so modify all of the callers to pass in the appropriate socket or skb->sk if no socket is known. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15ipvlan: use rcu_deference_bh() in ipvlan_queue_xmit()WANG Cong
In tx path rcu_read_lock_bh() is held, so we need rcu_deference_bh(). This fixes the following warning: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc1+ #1007 Not tainted ------------------------------- drivers/net/ipvlan/ipvlan.h:106 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by dhclient/1076: #0: (rcu_read_lock_bh){......}, at: [<ffffffff817e8d84>] rcu_lock_acquire+0x0/0x26 stack backtrace: CPU: 2 PID: 1076 Comm: dhclient Not tainted 4.1.0-rc1+ #1007 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff8800d381bac8 ffffffff81a4154f 000000003c1a3c19 ffff8800d4d0a690 ffff8800d381baf8 ffffffff810b849f ffff880117d41148 ffff880117d40000 ffff880117d40068 0000000000000156 ffff8800d381bb18 Call Trace: [<ffffffff81a4154f>] dump_stack+0x4c/0x65 [<ffffffff810b849f>] lockdep_rcu_suspicious+0x107/0x110 [<ffffffff8165a522>] ipvlan_port_get_rcu+0x47/0x4e [<ffffffff8165ad14>] ipvlan_queue_xmit+0x35/0x450 [<ffffffff817ea45d>] ? rcu_read_unlock+0x3e/0x5f [<ffffffff810a20bf>] ? local_clock+0x19/0x22 [<ffffffff810b4781>] ? __lock_is_held+0x39/0x52 [<ffffffff8165b64c>] ipvlan_start_xmit+0x1b/0x44 [<ffffffff817edf7f>] dev_hard_start_xmit+0x2ae/0x467 [<ffffffff817ee642>] __dev_queue_xmit+0x50a/0x60c [<ffffffff817ee7a7>] dev_queue_xmit_sk+0x13/0x15 [<ffffffff81997596>] dev_queue_xmit+0x10/0x12 [<ffffffff8199b41c>] packet_sendmsg+0xb6b/0xbdf [<ffffffff810b5ea7>] ? mark_lock+0x2e/0x226 [<ffffffff810a1fcc>] ? sched_clock_cpu+0x9e/0xb7 [<ffffffff817d56f9>] sock_sendmsg_nosec+0x12/0x1d [<ffffffff817d7257>] sock_sendmsg+0x29/0x2e [<ffffffff817d72cc>] sock_write_iter+0x70/0x91 [<ffffffff81199563>] __vfs_write+0x7e/0xa7 [<ffffffff811996bc>] vfs_write+0x92/0xe8 [<ffffffff811997d7>] SyS_write+0x47/0x7e [<ffffffff81a4d517>] system_call_fastpath+0x12/0x6f Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.") Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15ipvlan: unhash addresses without synchronize_rcuKonstantin Khlebnikov
All structures used in traffic forwarding are rcu-protected: ipvl_addr, ipvl_dev and ipvl_port. Thus we can unhash addresses without synchronization. We'll anyway hash it back into the same bucket: in worst case lockless lookup will scan hash once again. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05ipvlan: Defer multicast / broadcast processing to a work-queueMahesh Bandewar
Processing multicast / broadcast in fast path is performance draining and having more links means more cloning and bringing performance down further. Broadcast; in particular, need to be given to all the virtual links. Earlier tricks of enabling broadcast bit for IPv4 only interfaces are not really working since it fails autoconf. Which means enabling broadcast for all the links if protocol specific hacks do not have to be added into the driver. This patch defers all (incoming as well as outgoing) multicast traffic to a work-queue leaving only the unicast traffic in the fast-path. Now if we need to apply any additional tricks to further reduce the impact of this (multicast / broadcast) type of traffic, it can be implemented while processing this work without affecting the fast-path. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/asix_common.c drivers/net/usb/sr9800.c drivers/net/usb/usbnet.c include/linux/usb/usbnet.h net/ipv4/tcp_ipv4.c net/ipv6/tcp_ipv6.c The TCP conflicts were overlapping changes. In 'net' we added a READ_ONCE() to the socket cached RX route read, whilst in 'net-next' Eric Dumazet touched the surrounding code dealing with how mini sockets are handled. With USB, it's a case of the same bug fix first going into net-next and then I cherry picked it back into net. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02dev: introduce dev_get_iflink()Nicolas Dichtel
The goal of this patch is to prepare the removal of the iflink field. It introduces a new ndo function, which will be implemented by virtual interfaces. There is no functional change into this patch. All readers of iflink field now call dev_get_iflink(). Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31ipvlan: fix check for IP addresses in control pathJiri Benc
When an ipvlan interface is down, its addresses are not on the hash list. Fix checks for existence of addresses not to depend on the hash list, walk through all interface addresses instead. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31ipvlan: protect against concurrent link removalJiri Benc
Adding and removing to the 'ipvlans' list is already done using _rcu list operations. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31ipvlan: fix addr hash list corruptionJiri Benc
When ipvlan interface with IP addresses attached is brought down and then deleted, the assigned addresses are deleted twice from the address hash list, first on the interface down and second on the link deletion. Similarly, when an address is added while the interface is down, it is added second time once the interface is brought up. When the interface is down, the addresses should be kept off the hash list for performance reasons. Ensure this is true, which also fixes the double add problem. To fix the double free, check whether the address is hashed before removing it. Reported-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30net: mark some potential candidates __read_mostlyDaniel Borkmann
They are all either written once or extremly rarely (e.g. from init code), so we can move them to the .data..read_mostly section. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25ipvlan: fix incorrect usage of IS_ERR() macro in IPv6 code path.Mahesh Bandewar
The ip6_route_output() always returns a valid dst pointer unlike in IPv4 case. So the validation has to be different from the IPv4 path. Correcting that error in this patch. This was picked up by a static checker with a following warning - drivers/net/ipvlan/ipvlan_core.c:380 ipvlan_process_v6_outbound() warn: 'dst' isn't an ERR_PTR Signed-off-by: Mahesh Bandewar <maheshb@google.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-24ipvlan: Initial check-in of the IPVLAN driver.Mahesh Bandewar
This driver is very similar to the macvlan driver except that it uses L3 on the frame to determine the logical interface while functioning as packet dispatcher. It inherits L2 of the master device hence the packets on wire will have the same L2 for all the packets originating from all virtual devices off of the same master device. This driver was developed keeping the namespace use-case in mind. Hence most of the examples given here take that as the base setup where main-device belongs to the default-ns and virtual devices are assigned to the additional namespaces. The device operates in two different modes and the difference in these two modes in primarily in the TX side. (a) L2 mode : In this mode, the device behaves as a L2 device. TX processing upto L2 happens on the stack of the virtual device associated with (namespace). Packets are switched after that into the main device (default-ns) and queued for xmit. RX processing is simple and all multicast, broadcast (if applicable), and unicast belonging to the address(es) are delivered to the virtual devices. (b) L3 mode : In this mode, the device behaves like a L3 device. TX processing upto L3 happens on the stack of the virtual device associated with (namespace). Packets are switched to the main-device (default-ns) for the L2 processing. Hence the routing table of the default-ns will be used in this mode. RX processins is somewhat similar to the L2 mode except that in this mode only Unicast packets are delivered to the virtual device while main-dev will handle all other packets. The devices can be added using the "ip" command from the iproute2 package - ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ] Signed-off-by: Mahesh Bandewar <maheshb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Laurent Chavey <chavey@google.com> Cc: Tim Hockin <thockin@google.com> Cc: Brandon Philips <brandon.philips@coreos.com> Cc: Pavel Emelianov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>