summaryrefslogtreecommitdiff
path: root/drivers/net/tun.c
AgeCommit message (Collapse)Author
2019-07-25tun: mark small packets as owned by the tap sockAlexis Bauvin
- v1 -> v2: Move skb_set_owner_w to __tun_build_skb to reduce patch size Small packets going out of a tap device go through an optimized code path that uses build_skb() rather than sock_alloc_send_pskb(). The latter calls skb_set_owner_w(), but the small packet code path does not. The net effect is that small packets are not owned by the userland application's socket (e.g. QEMU), while large packets are. This can be seen with a TCP session, where packets are not owned when the window size is small enough (around PAGE_SIZE), while they are once the window grows (note that this requires the host to support virtio tso for the guest to offload segmentation). All this leads to inconsistent behaviour in the kernel, especially on netfilter modules that uses sk->socket (e.g. xt_owner). Fixes: 66ccbc9c87c2 ("tap: use build_skb() for small packet") Signed-off-by: Alexis Bauvin <abauvin@scaleway.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08coallocate socket_wq with socket itselfAl Viro
socket->wq is assign-once, set when we are initializing both struct socket it's in and struct socket_wq it points to. As the matter of fact, the only reason for separate allocation was the ability to RCU-delay freeing of socket_wq. RCU-delaying the freeing of socket itself gets rid of that need, so we can just fold struct socket_wq into the end of struct socket and simplify the life both for sock_alloc_inode() (one allocation instead of two) and for tun/tap oddballs, where we used to embed struct socket and struct socket_wq into the same structure (now - embedding just the struct socket). Note that reference to struct socket_wq in struct sock does remain a reference - that's unchanged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18tun: wake up waitqueues after IFF_UP is setFei Li
Currently after setting tap0 link up, the tun code wakes tx/rx waited queues up in tun_net_open() when .ndo_open() is called, however the IFF_UP flag has not been set yet. If there's already a wait queue, it would fail to transmit when checking the IFF_UP flag in tun_sendmsg(). Then the saving vhost_poll_start() will add the wq into wqh until it is waken up again. Although this works when IFF_UP flag has been set when tun_chr_poll detects; this is not true if IFF_UP flag has not been set at that time. Sadly the latter case is a fatal error, as the wq will never be waken up in future unless later manually setting link up on purpose. Fix this by moving the wakeup process into the NETDEV_UP event notifying process, this makes sure IFF_UP has been set before all waited queues been waken up. Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157Thomas Gleixner
Based on 3 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [graeme] [gregory] [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema] [hk] [hemahk]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1105 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-09tuntap: synchronize through tfiles array instead of tun->numqueuesJason Wang
When a queue(tfile) is detached through __tun_detach(), we move the last enabled tfile to the position where detached one sit but don't NULL out last position. We expect to synchronize the datapath through tun->numqueues. Unfortunately, this won't work since we're lacking sufficient mechanism to order or synchronize the access to tun->numqueues. To fix this, NULL out the last position during detaching and check RCU protected tfile against NULL instead of checking tun->numqueues in datapath. Cc: YueHaibing <yuehaibing@huawei.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: weiyongjun (A) <weiyongjun1@huawei.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Fixes: c8d68e6be1c3b ("tuntap: multiqueue support") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09tuntap: fix dividing by zero in ebpf queue selectionJason Wang
We need check if tun->numqueues is zero (e.g for the persist device) before trying to use it for modular arithmetic. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 96f84061620c6("tun: add eBPF based queue selection method") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23net: pass net_device argument to the eth_get_headlenStanislav Fomichev
Update all users of eth_get_headlen to pass network device, fetch network namespace from it and pass it down to the flow dissector. This commit is a noop until administrator inserts BPF flow dissector program. Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: intel-wired-lan@lists.osuosl.org Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Salil Mehta <salil.mehta@huawei.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-23net: convert rps_needed and rfs_needed to new static branch apiEric Dumazet
We prefer static_branch_unlikely() over static_key_false() these days. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21tun: Remove unused first parameter of tun_get_iff()Kirill Tkhai
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun deviceKirill Tkhai
In commit f2780d6d7475 "tun: Add ioctl() SIOCGSKNS cmd to allow obtaining net ns of tun device" it was missed that tun may change its net ns, while net ns of socket remains the same as it was created initially. SIOCGSKNS returns net ns of socket, so it is not suitable for obtaining net ns of device. We may have two tun devices with the same names in two net ns, and in this case it's not possible to determ, which of them fd refers to (TUNGETIFF will return the same name). This patch adds new ioctl() cmd for obtaining net ns of a device. Reported-by: Harald Albrecht <harald.albrecht@gmx.net> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net: remove 'fallback' argument from dev->ndo_select_queue()Paolo Abeni
After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-16tun: add a missing rcu_read_unlock() in error pathEric Dumazet
In my latest patch I missed one rcu_read_unlock(), in case device is down. Fixes: 4477138fa0ae ("tun: properly test for IFF_UP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-15tun: properly test for IFF_UPEric Dumazet
Same reasons than the ones explained in commit 4179cb5a4c92 ("vxlan: test dev->flags & IFF_UP before calling netif_rx()") netif_rx_ni() or napi_gro_frags() must be called under a strict contract. At device dismantle phase, core networking clears IFF_UP and flush_all_backlogs() is called after rcu grace period to make sure no incoming packet might be in a cpu backlog and still referencing the device. A similar protocol is used for gro layer. Most drivers call netif_rx() from their interrupt handler, and since the interrupts are disabled at device dismantle, netif_rx() does not have to check dev->flags & IFF_UP Virtual drivers do not have this guarantee, and must therefore make the check themselves. Fixes: 1bd4978a88ac ("tun: honor IFF_UP in tun_get_user()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2019-02-25tun: remove unnecessary memory barrierTimur Celik
Replace set_current_state with __set_current_state since no memory barrier is needed at this point. Signed-off-by: Timur Celik <mail@timurcelik.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24tun: fix blocking readTimur Celik
This patch moves setting of the current state into the loop. Otherwise the task may end up in a busy wait loop if none of the break conditions are met. Signed-off-by: Timur Celik <mail@timurcelik.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22net: Don't set transport offset to invalid valueMaxim Mikityanskiy
If the socket was created with socket(AF_PACKET, SOCK_RAW, 0), skb->protocol will be unset, __skb_flow_dissect() will fail, and skb_probe_transport_header() will fall back to the offset_hint, making the resulting skb_transport_offset incorrect. If, however, there is no transport header in the packet, transport_header shouldn't be set to an arbitrary value. Fix it by leaving the transport offset unset if it couldn't be found, to be explicit rather than to fill it with some wrong value. It changes the behavior, but if some code relied on the old behavior, it would be broken anyway, as the old one is incorrect. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30tun: move the call to tun_set_real_num_queuesGeorge Amanakis
Call tun_set_real_num_queues() after the increment of tun->numqueues since the former depends on it. Otherwise, the number of queues is not correctly accounted for, which results to warnings similar to: "vnet0 selects TX queue 11, but real number of TX queues is 11". Fixes: 0b7959b62573 ("tun: publish tfile after it's fully initialized") Reported-and-tested-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-10tun: publish tfile after it's fully initializedStanislav Fomichev
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d1 Call Trace: ? napi_gro_frags+0xa7/0x2c0 tun_get_user+0xb50/0xf20 tun_chr_write_iter+0x53/0x70 new_sync_write+0xff/0x160 vfs_write+0x191/0x1e0 __x64_sys_write+0x5e/0xd0 do_syscall_64+0x47/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I think there is a subtle race between sending a packet via tap and attaching it: CPU0: CPU1: tun_chr_ioctl(TUNSETIFF) tun_set_iff tun_attach rcu_assign_pointer(tfile->tun, tun); tun_fops->write_iter() tun_chr_write_iter tun_napi_alloc_frags napi_get_frags napi->skb = napi_alloc_skb tun_napi_init netif_napi_add napi->skb = NULL napi->skb is NULL here napi_gro_frags napi_frags_skb skb = napi->skb skb_reset_mac_header(skb) panic() Move rcu_assign_pointer(tfile->tun) and rcu_assign_pointer(tun->tfiles) to be the last thing we do in tun_attach(); this should guarantee that when we call tun_get() we always get an initialized object. v2 changes: * remove extra napi_mutex locks/unlocks for napi operations Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14tun: replace get_cpu_ptr with this_cpu_ptr when bh disabledPrashant Bhole
tun_xdp_one() runs with local bh disabled. So there is no need to disable preemption by calling get_cpu_ptr while updating stats. This patch replaces the use of get_cpu_ptr() with this_cpu_ptr() as a micro-optimization. Also removes related put_cpu_ptr call. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-13net: dev: Add extack argument to dev_set_mac_address()Petr Machata
A follow-up patch will add a notifier type NETDEV_PRE_CHANGEADDR, which allows vetoing of MAC address changes. One prominent path to that notification is through dev_set_mac_address(). Therefore give this function an extack argument, so that it can be packed together with the notification. Thus a textual reason for rejection (or a warning) can be communicated back to the user. Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06tun: remove unnecessary check in tun_flow_updateLi RongQing
caller has guaranted that rxhash is not zero Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06tun: align write-heavy flow entry members to a cache lineLi RongQing
tun flow entry 'updated' fields are written when receive every packet. Thus if a flow is receiving packets from a particular flow entry, it'll cause false-sharing with all the other who has looked it up, so move it in its own cache line and update 'queue_index' and 'update' field only when they are changed to reduce the cache false-sharing. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03tun: remove skb access after netif_receive_skbPrashant Bhole
In tun.c skb->len was accessed while doing stats accounting after a call to netif_receive_skb. We can not access skb after this call because buffers may be dropped. The fix for this bug would be to store skb->len in local variable and then use it after netif_receive_skb(). IMO using xdp data size for accounting bytes will be better because input for tun_xdp_one() is xdp_buff. Hence this patch: - fixes a bug by removing skb access after netif_receive_skb() - uses xdp data size for accounting bytes [613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun] [613.021062] Read of size 4 at addr ffff8881da9ab7c0 by task vhost-1115/1155 [613.023073] [613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232 [613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [613.029116] Call Trace: [613.031145] dump_stack+0x5b/0x90 [613.032219] print_address_description+0x6c/0x23c [613.034156] ? tun_sendmsg+0x77c/0xc50 [tun] [613.036141] kasan_report.cold.5+0x241/0x308 [613.038125] tun_sendmsg+0x77c/0xc50 [tun] [613.040109] ? tun_get_user+0x1960/0x1960 [tun] [613.042094] ? __isolate_free_page+0x270/0x270 [613.045173] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.047127] ? peek_head_len.part.13+0x90/0x90 [vhost_net] [613.049096] ? get_tx_bufs+0x5a/0x2c0 [vhost_net] [613.051106] ? vhost_enable_notify+0x2d8/0x420 [vhost] [613.053139] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.053139] ? vhost_net_buf_peek+0x340/0x340 [vhost_net] [613.053139] ? __mutex_lock+0x8d9/0xb30 [613.053139] ? finish_task_switch+0x8f/0x3f0 [613.053139] ? handle_tx+0x32/0x120 [vhost_net] [613.053139] ? mutex_trylock+0x110/0x110 [613.053139] ? finish_task_switch+0xcf/0x3f0 [613.053139] ? finish_task_switch+0x240/0x3f0 [613.053139] ? __switch_to_asm+0x34/0x70 [613.053139] ? __switch_to_asm+0x40/0x70 [613.053139] ? __schedule+0x506/0xf10 [613.053139] handle_tx+0xc7/0x120 [vhost_net] [613.053139] vhost_worker+0x166/0x200 [vhost] [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] ? __kthread_parkme+0x77/0x90 [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] kthread+0x1b1/0x1d0 [613.053139] ? kthread_park+0xb0/0xb0 [613.053139] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Allocated by task 1155: [613.088705] kasan_kmalloc+0xbf/0xe0 [613.088705] kmem_cache_alloc+0xdc/0x220 [613.088705] __build_skb+0x2a/0x160 [613.088705] build_skb+0x14/0xc0 [613.088705] tun_sendmsg+0x4f0/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Freed by task 1155: [613.088705] __kasan_slab_free+0x12e/0x180 [613.088705] kmem_cache_free+0xa0/0x230 [613.088705] ip6_mc_input+0x40f/0x5a0 [613.088705] ipv6_rcv+0xc9/0x1e0 [613.088705] __netif_receive_skb_one_core+0xc1/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] br_pass_frame_up+0x2b9/0x2e0 [613.088705] br_handle_frame_finish+0x2fb/0x7a0 [613.088705] br_handle_frame+0x30f/0x6c0 [613.088705] __netif_receive_skb_core+0x61a/0x15b0 [613.088705] __netif_receive_skb_one_core+0x8e/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] tun_sendmsg+0x738/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] The buggy address belongs to the object at ffff8881da9ab740 [613.088705] which belongs to the cache skbuff_head_cache of size 232 Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30tun: forbid iface creation with rtnl opsNicolas Dichtel
It's not supported right now (the goal of the initial patch was to support 'ip link del' only). Before the patch: $ ip link add foo type tun [ 239.632660] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [snip] [ 239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0 This panic occurs because dev->netdev_ops is not set by tun_setup(). But to have something usable, it will require more than just setting netdev_ops. Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX") CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30tun: implement carrier changeNicolas Dichtel
The userspace may need to control the carrier state. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Didier Pallard <didier.pallard@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-11-18tuntap: fix multiqueue rxMatthew Cover
When writing packets to a descriptor associated with a combined queue, the packets should end up on that queue. Before this change all packets written to any descriptor associated with a tap interface end up on rx-0, even when the descriptor is associated with a different queue. The rx traffic can be generated by either of the following. 1. a simple tap program which spins up multiple queues and writes packets to each of the file descriptors 2. tx from a qemu vm with a tap multiqueue netdev The queue for rx traffic can be observed by either of the following (done on the hypervisor in the qemu case). 1. a simple netmap program which opens and reads from per-queue descriptors 2. configuring RPS and doing per-cpu captures with rxtxcpu Alternatively, if you printk() the return value of skb_get_rx_queue() just before each instance of netif_receive_skb() in tun.c, you will get 65535 for every skb. Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes the association between descriptor and rx queue. Signed-off-by: Matthew Cover <matthew.cover@stackpath.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-18tun: use netdev_alloc_frag() in tun_napi_alloc_frags()Eric Dumazet
In order to cook skbs in the same way than Ethernet drivers, it is probably better to not use GFP_KERNEL, but rather use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by netdev_alloc_frag(). This would allow to use tun driver even in memory stress situations, especially if swap is used over this tun channel. Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petar Penkov <peterpenkov96@gmail.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17tun: Adjust on-stack tun_page initialization.David S. Miller
Instead of constantly playing with the struct initializer syntax trying to make gcc and CLang both happy, just clear it out using memset(). >> drivers/net/tun.c:2503:42: warning: Using plain integer as NULL pointer Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17tuntap: free XDP dropped packets in a batchJason Wang
Thanks to the batched XDP buffs through msg_control. Instead of calling put_page() for each page which involves a atomic operation, let's batch them by record the last page that needs to be freed and its refcnt count and free them in a batch. Testpmd(virtio-user + vhost_net) + XDP_DROP shows 3.8% improvement. Before: 4.71Mpps After : 4.89Mpps Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07tun: compute the RFS hash only if needed.Paolo Abeni
The tun XDP sendmsg code path, unconditionally computes the symmetric hash of each packet for RFS's sake, even when we could skip it. e.g. when the device has a single queue. This change adds the check already in-place for the skb sendmsg path to avoid unneeded hashing. The above gives small, but measurable, performance gain for VM xmit path when zerocopy is not enabled. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15tun: Consistently configure generic netdev params via rtnetlinkSerhey Popovych
Configuring generic network device parameters on tun will fail in presence of IFLA_INFO_KIND attribute in IFLA_LINKINFO nested attribute since tun_validate() always return failure. This can be visualized with following ip-link(8) command sequences: # ip link set dev tun0 group 100 # ip link set dev tun0 group 100 type tun RTNETLINK answers: Invalid argument with contrast to dummy and veth drivers: # ip link set dev dummy0 group 100 # ip link set dev dummy0 type dummy # ip link set dev veth0 group 100 # ip link set dev veth0 group 100 type veth Fix by returning zero in tun_validate() when @data is NULL that is always in case since rtnl_link_ops->maxtype is zero in tun driver. Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX") Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10net: tun: remove useless codes of tun_automq_select_queueWang Li
Because the function __skb_get_hash_symmetric always returns non-zero. Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Wang Li <wangli39@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net' overlapped the renaming of a netlink attribute in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: napi flags belong to tfileEric Dumazet
Since tun->flags might be shared by multiple tfile structures, it is better to make sure tun_get_user() is using the flags for the current tfile. Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint of what could happen, but we need something stronger to please syzbot. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715 tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967 call_write_iter include/linux/fs.h:1808 [inline] new_sync_write fs/read_write.c:474 [inline] __vfs_write+0x6b8/0x9f0 fs/read_write.c:487 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457579 Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe003614c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579 RDX: 0000000000000012 RSI: 0000000020000000 RDI: 000000000000000a RBP: 000000000072c040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0036156d4 R13: 00000000004c5574 R14: 00000000004d8e98 R15: 00000000ffffffff Modules linked in: RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: initialize napi_mutex unconditionallyEric Dumazet
This is the first part to fix following syzbot report : console output: https://syzkaller.appspot.com/x/log.txt?x=145378e6400000 kernel config: https://syzkaller.appspot.com/x/.config?x=443816db871edd66 dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80 Following patch is fixing the race condition, but it seems safer to initialize this mutex at tfile creation anyway. Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+e662df0ac1d753b57e80@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: remove unused parametersEric Dumazet
tun_napi_disable() and tun_napi_del() do not need a pointer to the tun_struct Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23tun: remove ndo_poll_controllerEric Dumazet
As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. tun uses NAPI for TX completions, so we better let core networking stack call the napi->poll() to avoid the capture. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: accept an array of XDP buffs through sendmsg()Jason Wang
This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. If an XDP program is attached, tuntap can run XDP program directly. If not, tuntap will build skb and do a fast receiving since part of the work has been done by vhost_net. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tun: switch to new type of msg_controlJason Wang
This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: move XDP flushing out of tun_do_xdp()Jason Wang
This will allow adding batch flushing on top. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: split out XDP logicJason Wang
This patch split out XDP logic into a single function. This make it to be reused by XDP batching path in the following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: tweak on the path of skb XDP case in tun_build_skb()Jason Wang
If we're sure not to go native XDP, there's no need for several things like bh and rcu stuffs. So this patch introduces a helper to build skb and hold page refcnt. When we found we will go through skb path, build skb directly. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: simplify error handling in tun_build_skb()Jason Wang
There's no need to duplicate page get logic in each action. So this patch tries to get page and calculate the offset before processing XDP actions (except for XDP_DROP), and undo them when meet errors (we don't care the performance on errors). This will be used for factoring out XDP logic. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: enable bh early during processing XDPJason Wang
This patch move the bh enabling a little bit earlier, this will be used for factoring out the core XDP logic of tuntap. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tuntap: switch to use XDP_PACKET_HEADROOMJason Wang
Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13net: sock: introduce SOCK_XDPJason Wang
This patch introduces a new sock flag - SOCK_XDP. This will be used for notifying the upper layer that XDP program is attached on the lower socket, and requires for extra headroom. TUN will be the first user. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>