summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-06-22xfrm: bail early on slave pass over skbJarod Wilson
This is prep work for initial support of bonding hardware encryption pass-through support. The bonding driver will fill in the slave_dev pointer, and we use that to know not to skb_push() again on a given skb that was already processed on the bond device. CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Jakub Kicinski <kuba@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: netdev@vger.kernel.org CC: intel-wired-lan@lists.osuosl.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22net/devlink: Support setting hardware address of port functionParav Pandit
PCI PF and VF devlink port can manage the function represented by a devlink port. Allow users to set port function's hardware address. Example of a PCI VF port which supports a port function: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:11:22:33:44:55 Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22net/devlink: Support querying hardware address of port functionParav Pandit
PCI PF and VF devlink port can manage the function represented by a devlink port. Enable users to query port function's hardware address. Example of a PCI VF port which supports a port function: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:11:22:33:44:66 $ devlink port show pci/0000:06:00.0/2 -jp { "port": { "pci/0000:06:00.0/2": { "type": "eth", "netdev": "enp6s0pf0vf1", "flavour": "pcivf", "pfnum": 0, "vfnum": 1, "function": { "hw_addr": "00:11:22:33:44:66" } } } } Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22net/devlink: Prepare devlink port functions to fill extackParav Pandit
Prepare devlink port related functions to optionally fill up the extack information which will be used in subsequent patch by port function attribute(s). Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-22bpf: Set map_btf_{name, id} for all map typesAndrey Ignatov
Set map_btf_name and map_btf_id for all map types so that map fields can be accessed by bpf programs. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-22bpf: Rename bpf_htab to bpf_shtab in sock_mapAndrey Ignatov
There are two different `struct bpf_htab` in bpf code in the following files: - kernel/bpf/hashtab.c - net/core/sock_map.c It makes it impossible to find proper btf_id by name = "bpf_htab" and kind = BTF_KIND_STRUCT what is needed to support access to map ptr so that bpf program can access `struct bpf_htab` fields. To make it possible one of the struct-s should be renamed, sock_map.c looks like a better candidate for rename since it's specialized version of hashtab. Rename it to bpf_shtab ("sh" stands for Sock Hash). Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/c006a639e03c64ca50fc87c4bb627e0bfba90f4e.1592600985.git.rdna@fb.com
2020-06-22xprtrdma: Fix handling of RDMA_ERROR repliesChuck Lever
The RPC client currently doesn't handle ERR_CHUNK replies correctly. rpcrdma_complete_rqst() incorrectly passes a negative number to xprt_complete_rqst() as the number of bytes copied. Instead, set task->tk_status to the error value, and return zero bytes copied. In these cases, return -EIO rather than -EREMOTEIO. The RPC client's finite state machine doesn't know what to do with -EREMOTEIO. Additional clean ups: - Don't double-count RDMA_ERROR replies - Remove a stale comment Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.vger.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22xprtrdma: Clean up disconnectChuck Lever
1. Ensure that only rpcrdma_cm_event_handler() modifies ep->re_connect_status to avoid racy changes to that field. 2. Ensure that xprt_force_disconnect() is invoked only once as a transport is closed or destroyed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22xprtrdma: Clean up synopsis of rpcrdma_flush_disconnect()Chuck Lever
Refactor: Pass struct rpcrdma_xprt instead of an IB layer object. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22xprtrdma: Use re_connect_status safely in rpcrdma_xprt_connect()Chuck Lever
Clean up: Sometimes creating a fresh rpcrdma_ep can fail. That's why xprt_rdma_connect() always checks if the r_xprt->rx_ep pointer is valid before dereferencing it. Instead, xprt_rdma_connect() can simply check rpcrdma_xprt_connect()'s return value. Also, there's no need to set re_connect_status to zero just after the rpcrdma_ep is created, since it is allocated with kzalloc. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-22xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freedChuck Lever
r_xprt->rx_ep is known to be good while the transport's send lock is held. Otherwise additional references on rx_ep must be held when it is used outside of that lock's critical sections. For now, bump the rx_ep reference count once whenever there is at least one outstanding Receive WR. This avoids the memory bandwidth overhead of taking and releasing the reference count for every ib_post_recv() and Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-06-20net: Add MODULE_DESCRIPTION entries to network modulesRob Gill
The user tool modinfo is used to get information on kernel modules, including a description where it is available. This patch adds a brief MODULE_DESCRIPTION to the following modules: 9p drop_monitor esp4_offload esp6_offload fou fou6 ila sch_fq sch_fq_codel sch_hhf Signed-off-by: Rob Gill <rrobgill@protonmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20rxrpc: Fix notification call on completion of discarded callsDavid Howells
When preallocated service calls are being discarded, they're passed to ->discard_new_call() to have the caller clean up any attached higher-layer preallocated pieces before being marked completed. However, the act of marking them completed now invokes the call's notification function - which causes a problem because that function might assume that the previously freed pieces of memory are still there. Fix this by setting a dummy notification function on the socket after calling ->discard_new_call(). This results in the following kasan message when the kafs module is removed. ================================================================== BUG: KASAN: use-after-free in afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707 Write of size 1 at addr ffff8880946c39e4 by task kworker/u4:1/21 CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.8.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 afs_wake_up_async_call+0x6aa/0x770 fs/afs/rxrpc.c:707 rxrpc_notify_socket+0x1db/0x5d0 net/rxrpc/recvmsg.c:40 __rxrpc_set_call_completion.part.0+0x172/0x410 net/rxrpc/recvmsg.c:76 __rxrpc_call_completed net/rxrpc/recvmsg.c:112 [inline] rxrpc_call_completed+0xca/0xf0 net/rxrpc/recvmsg.c:111 rxrpc_discard_prealloc+0x781/0xab0 net/rxrpc/call_accept.c:233 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603 process_one_work+0x965/0x1690 kernel/workqueue.c:2269 worker_thread+0x96/0xe10 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:291 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 Allocated by task 6820: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc mm/kasan/common.c:494 [inline] __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:467 kmem_cache_alloc_trace+0x153/0x7d0 mm/slab.c:3551 kmalloc include/linux/slab.h:555 [inline] kzalloc include/linux/slab.h:669 [inline] afs_alloc_call+0x55/0x630 fs/afs/rxrpc.c:141 afs_charge_preallocation+0xe9/0x2d0 fs/afs/rxrpc.c:757 afs_open_socket+0x292/0x360 fs/afs/rxrpc.c:92 afs_net_init+0xa6c/0xe30 fs/afs/main.c:125 ops_init+0xaf/0x420 net/core/net_namespace.c:151 setup_net+0x2de/0x860 net/core/net_namespace.c:341 copy_net_ns+0x293/0x590 net/core/net_namespace.c:482 create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:231 ksys_unshare+0x43d/0x8e0 kernel/fork.c:2983 __do_sys_unshare kernel/fork.c:3051 [inline] __se_sys_unshare kernel/fork.c:3049 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3049 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 21: save_stack+0x1b/0x40 mm/kasan/common.c:48 set_track mm/kasan/common.c:56 [inline] kasan_set_free_info mm/kasan/common.c:316 [inline] __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:455 __cache_free mm/slab.c:3426 [inline] kfree+0x109/0x2b0 mm/slab.c:3757 afs_put_call+0x585/0xa40 fs/afs/rxrpc.c:190 rxrpc_discard_prealloc+0x764/0xab0 net/rxrpc/call_accept.c:230 rxrpc_listen+0x147/0x360 net/rxrpc/af_rxrpc.c:245 afs_close_socket+0x95/0x320 fs/afs/rxrpc.c:110 afs_net_exit+0x1bc/0x310 fs/afs/main.c:155 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186 cleanup_net+0x511/0xa50 net/core/net_namespace.c:603 process_one_work+0x965/0x1690 kernel/workqueue.c:2269 worker_thread+0x96/0xe10 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:291 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293 The buggy address belongs to the object at ffff8880946c3800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 484 bytes inside of 1024-byte region [ffff8880946c3800, ffff8880946c3c00) The buggy address belongs to the page: page:ffffea000251b0c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0xfffe0000000200(slab) raw: 00fffe0000000200 ffffea0002546508 ffffea00024fa248 ffff8880aa000c40 raw: 0000000000000000 ffff8880946c3000 0000000100000002 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880946c3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880946c3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880946c3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880946c3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880946c3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Reported-by: syzbot+d3eccef36ddbd02713e9@syzkaller.appspotmail.com Fixes: 5ac0d62226a0 ("rxrpc: Fix missing notification") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20Remove redundant skb null checkGaurav Singh
Remove the redundant null check for skb. Signed-off-by: Gaurav Singh <gaurav1086@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20tcp: remove indirect calls for icsk->icsk_af_ops->send_checkEric Dumazet
Mitigate RETPOLINE costs in __tcp_transmit_skb() by using INDIRECT_CALL_INET() wrapper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20tcp: remove indirect calls for icsk->icsk_af_ops->queue_xmitEric Dumazet
Mitigate RETPOLINE costs in __tcp_transmit_skb() by using INDIRECT_CALL_INET() wrapper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20Remove redundant condition in qdisc_graftGaurav Singh
parent cannot be NULL here since its in the else part of the if (parent == NULL) condition. Remove the extra check on parent pointer. Signed-off-by: Gaurav Singh <gaurav1086@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20l3mdev: add infrastructure for table to VRF mappingAndrea Mayer
Add infrastructure to l3mdev (the core code for Layer 3 master devices) in order to find out the corresponding VRF device for a given table id. Therefore, the l3mdev implementations: - can register a callback that returns the device index of the l3mdev associated with a given table id; - can offer the lookup function (table to VRF device). Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net/sched: cls_u32: Use struct_size() in kzalloc()Gustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19taprio: Use struct_size() in kzalloc()Gustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. Also, remove unnecessary variable _size_. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19tipc: Use struct_size() helperGustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net/sched: cls_api: fix nooffloaddevcnt warning dmesg logwenxu
The block->nooffloaddevcnt should always count for indr block. even the indr block offload successful. The representor maybe gone away and the ingress qdisc can work in software mode. block->nooffloaddevcnt warning with following dmesg log: [ 760.667058] ##################################################### [ 760.668186] ## TEST test-ecmp-add-vxlan-encap-disable-sriov.sh ## [ 760.669179] ##################################################### [ 761.780655] :test: Fedora 30 (Thirty) [ 761.783794] :test: Linux reg-r-vrt-018-180 5.7.0+ [ 761.822890] :test: NIC ens1f0 FW 16.26.6000 PCI 0000:81:00.0 DEVICE 0x1019 ConnectX-5 Ex [ 761.860244] mlx5_core 0000:81:00.0 ens1f0: Link up [ 761.880693] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f0: link becomes ready [ 762.059732] mlx5_core 0000:81:00.1 ens1f1: Link up [ 762.234341] :test: unbind vfs of ens1f0 [ 762.257825] :test: Change ens1f0 eswitch (0000:81:00.0) mode to switchdev [ 762.291363] :test: unbind vfs of ens1f1 [ 762.306914] :test: Change ens1f1 eswitch (0000:81:00.1) mode to switchdev [ 762.309237] mlx5_core 0000:81:00.1: E-Switch: Disable: mode(LEGACY), nvfs(2), active vports(3) [ 763.282598] mlx5_core 0000:81:00.1: E-Switch: Supported tc offload range - chains: 4294967294, prios: 4294967295 [ 763.362825] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.444465] mlx5_core 0000:81:00.1 ens1f1: renamed from eth0 [ 763.460088] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.502586] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 763.552429] ens1f1_0: renamed from eth0 [ 763.569569] mlx5_core 0000:81:00.1: E-Switch: Enable: mode(OFFLOADS), nvfs(2), active vports(3) [ 763.629694] ens1f1_1: renamed from eth1 [ 764.631552] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1_0: link becomes ready [ 764.670841] :test: unbind vfs of ens1f0 [ 764.681966] :test: unbind vfs of ens1f1 [ 764.726762] mlx5_core 0000:81:00.0 ens1f0: Link up [ 764.766511] mlx5_core 0000:81:00.1 ens1f1: Link up [ 764.797325] :test: Add multipath vxlan encap rule and disable sriov [ 764.798544] :test: config multipath route [ 764.812732] mlx5_core 0000:81:00.0: lag map port 1:2 port 2:2 [ 764.874556] mlx5_core 0000:81:00.0: modify lag map port 1:1 port 2:2 [ 765.603681] :test: OK [ 765.659048] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1_1: link becomes ready [ 765.675085] :test: verify rule in hw [ 765.694237] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f0: link becomes ready [ 765.711892] IPv6: ADDRCONF(NETDEV_CHANGE): ens1f1: link becomes ready [ 766.979230] :test: OK [ 768.125419] :test: OK [ 768.127519] :test: - disable sriov ens1f1 [ 768.131160] pci 0000:81:02.2: Removing from iommu group 75 [ 768.132646] pci 0000:81:02.3: Removing from iommu group 76 [ 769.179749] mlx5_core 0000:81:00.1: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) [ 769.455627] mlx5_core 0000:81:00.0: modify lag map port 1:1 port 2:1 [ 769.703990] mlx5_core 0000:81:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [ 769.988637] mlx5_core 0000:81:00.1 ens1f1: renamed from eth0 [ 769.990022] :test: - disable sriov ens1f0 [ 769.994922] pci 0000:81:00.2: Removing from iommu group 73 [ 769.997048] pci 0000:81:00.3: Removing from iommu group 74 [ 771.035813] mlx5_core 0000:81:00.0: E-Switch: Disable: mode(OFFLOADS), nvfs(2), active vports(3) [ 771.339091] ------------[ cut here ]------------ [ 771.340812] WARNING: CPU: 6 PID: 3448 at net/sched/cls_api.c:749 tcf_block_offload_unbind.isra.0+0x5c/0x60 [ 771.341728] Modules linked in: act_mirred act_tunnel_key cls_flower dummy vxlan ip6_udp_tunnel udp_tunnel sch_ingress nfsv3 nfs_acl nfs lockd grace fscache tun bridge stp llc sunrpc rdma_ucm rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5_core intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mlxfw act_ct nf_flow_table kvm_intel nf_nat kvm nf_conntrack irqbypass crct10dif_pclmul igb crc32_pclmul nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 crc32c_intel ghash_clmulni_intel ptp ipmi_ssif intel_cstate pps_c ore ses intel_uncore mei_me iTCO_wdt joydev ipmi_si iTCO_vendor_support i2c_i801 enclosure mei ioatdma dca lpc_ich wmi ipmi_devintf pcspkr acpi_power_meter ipmi_msghandler acpi_pad ast i2c_algo_bit drm_vram_helper drm_kms_helper drm_ttm_helper ttm drm mpt3sas raid_class scsi_transport_sas [ 771.347818] CPU: 6 PID: 3448 Comm: test-ecmp-add-v Not tainted 5.7.0+ #1146 [ 771.348727] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 771.349646] RIP: 0010:tcf_block_offload_unbind.isra.0+0x5c/0x60 [ 771.350553] Code: 4a fd ff ff 83 f8 a1 74 0e 5b 4c 89 e7 5d 41 5c 41 5d e9 07 93 89 ff 8b 83 a0 00 00 00 8d 50 ff 89 93 a0 00 00 00 85 c0 75 df <0f> 0b eb db 0f 1f 44 00 00 41 57 41 56 41 55 41 89 cd 41 54 49 89 [ 771.352420] RSP: 0018:ffffb33144cd3b00 EFLAGS: 00010246 [ 771.353353] RAX: 0000000000000000 RBX: ffff8b37cf4b2800 RCX: 0000000000000000 [ 771.354294] RDX: 00000000ffffffff RSI: ffff8b3b9aad0000 RDI: ffffffff8d5c6e20 [ 771.355245] RBP: ffff8b37eb546948 R08: ffffffffc0b7a348 R09: ffff8b3b9aad0000 [ 771.356189] R10: 0000000000000001 R11: ffff8b3ba7a0a1c0 R12: ffff8b37cf4b2850 [ 771.357123] R13: ffff8b3b9aad0000 R14: ffff8b37cf4b2820 R15: ffff8b37cf4b2820 [ 771.358039] FS: 00007f8a19b6e740(0000) GS:ffff8b3befa00000(0000) knlGS:0000000000000000 [ 771.358965] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 771.359885] CR2: 00007f3afb91c1a0 CR3: 000000045133c004 CR4: 00000000001606e0 [ 771.360825] Call Trace: [ 771.361764] __tcf_block_put+0x84/0x150 [ 771.362712] ingress_destroy+0x1b/0x20 [sch_ingress] [ 771.363658] qdisc_destroy+0x3e/0xc0 [ 771.364594] dev_shutdown+0x7a/0xa5 [ 771.365522] rollback_registered_many+0x20d/0x530 [ 771.366458] ? netdev_upper_dev_unlink+0x15d/0x1c0 [ 771.367387] unregister_netdevice_many.part.0+0xf/0x70 [ 771.368310] vxlan_netdevice_event+0xa4/0x110 [vxlan] [ 771.369454] notifier_call_chain+0x4c/0x70 [ 771.370579] rollback_registered_many+0x2f5/0x530 [ 771.371719] rollback_registered+0x56/0x90 [ 771.372843] unregister_netdevice_queue+0x73/0xb0 [ 771.373982] unregister_netdev+0x18/0x20 [ 771.375168] mlx5e_vport_rep_unload+0x56/0xc0 [mlx5_core] [ 771.376327] esw_offloads_disable+0x81/0x90 [mlx5_core] [ 771.377512] mlx5_eswitch_disable_locked.cold+0xcb/0x1af [mlx5_core] [ 771.378679] mlx5_eswitch_disable+0x44/0x60 [mlx5_core] [ 771.379822] mlx5_device_disable_sriov+0xad/0xb0 [mlx5_core] [ 771.380968] mlx5_core_sriov_configure+0xc1/0xe0 [mlx5_core] [ 771.382087] sriov_numvfs_store+0xfc/0x130 [ 771.383195] kernfs_fop_write+0xce/0x1b0 [ 771.384302] vfs_write+0xb6/0x1a0 [ 771.385410] ksys_write+0x5f/0xe0 [ 771.386500] do_syscall_64+0x5b/0x1d0 [ 771.387569] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: flow_offload: fix flow_indr_dev_unregister pathwenxu
If the representor is removed, then identify the indirect flow_blocks that need to be removed by the release callback and the port representor structure. To identify the port representor structure, a new indr.cb_priv field needs to be introduced. The flow_block also needs to be removed from the driver list from the cleanup path. Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19flow_offload: use flow_indr_block_cb_alloc/remove functionwenxu
Prepare fix the bug in the next patch. use flow_indr_block_cb_alloc/remove function and remove the __flow_block_indr_binding. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19flow_offload: add flow_indr_block_cb_alloc/remove functionwenxu
Add flow_indr_block_cb_alloc/remove function for next fix patch. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge tag 'rxrpc-fixes-20200618' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Performance drop fix and other fixes Here are three fixes for rxrpc: (1) Fix a trace symbol mapping. It doesn't seem to let you map to "". (2) Fix the handling of the remote receive window size when it increases beyond the size we can support for our transmit window. (3) Fix a performance drop caused by retransmitted packets being accidentally marked as already ACK'd. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19ipv6: icmp6: avoid indirect call for icmpv6_send()Eric Dumazet
If IPv6 is builtin, we do not need an expensive indirect call to reach icmp6_send(). v2: put inline keyword before the type to avoid sparse warnings. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2020-06-19 1) Fix double ESP trailer insertion in IPsec crypto offload if netif_xmit_frozen_or_stopped is true. From Huy Nguyen. 2) Merge fixup for "remove output_finish indirection from xfrm_state_afinfo". From Stephen Rothwell. 3) Select CRYPTO_SEQIV for ESP as this is needed for GCM and several other encryption algorithms. Also modernize the crypto algorithm selections for ESP and AH, remove those that are maked as "MUST NOT" and add those that are marked as "MUST" be implemented in RFC 8221. From Eric Biggers. Please note the merge conflict between commit: a7f7f6248d97 ("treewide: replace '---help---' in Kconfig files with 'help'") from Linus' tree and commits: 7d4e39195925 ("esp, ah: consolidate the crypto algorithm selections") be01369859b8 ("esp, ah: modernize the crypto algorithm selections") from the ipsec tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19net: qos offload add flow status with dropped countPo Liu
This patch adds a drop frames counter to tc flower offloading. Reporting h/w dropped frames is necessary for some actions. Some actions like police action and the coming introduced stream gate action would produce dropped frames which is necessary for user. Status update shows how many filtered packets increasing and how many dropped in those packets. v2: Changes - Update commit comments suggest by Jiri Pirko. Signed-off-by: Po Liu <Po.Liu@nxp.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-19Merge tag 'ceph-for-5.8-rc2' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "An important follow-up for replica reads support that went into -rc1 and two target_copy() fixups" * tag 'ceph-for-5.8-rc2' of git://github.com/ceph/ceph-client: libceph: don't omit used_replica in target_copy() libceph: don't omit recovery_deletes in target_copy() libceph: move away from global osd_req_flags
2020-06-18net: increment xmit_recursion level in dev_direct_xmit()Eric Dumazet
Back in commit f60e5990d9c1 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Hannes added code so that IPv6 stack would not trust skb->sk for typical cases where packet goes through 'standard' xmit path (__dev_queue_xmit()) Alas af_packet had a dev_direct_xmit() path that was not dealing yet with xmit_recursion level. Also change sk_mc_loop() to dump a stack once only. Without this patch, syzbot was able to trigger : [1] [ 153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70 [ 153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core [ 153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273 [ 153.567387] RIP: 0010:sk_mc_loop+0x51/0x70 [ 153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 <0f> 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0 [ 153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212 [ 153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007 [ 153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000 [ 153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008 [ 153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000 [ 153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00 [ 153.567390] FS: 00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000 [ 153.567390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0 [ 153.567391] Call Trace: [ 153.567391] ip6_finish_output2+0x34e/0x550 [ 153.567391] __ip6_finish_output+0xe7/0x110 [ 153.567391] ip6_finish_output+0x2d/0xb0 [ 153.567392] ip6_output+0x77/0x120 [ 153.567392] ? __ip6_finish_output+0x110/0x110 [ 153.567392] ip6_local_out+0x3d/0x50 [ 153.567392] ipvlan_queue_xmit+0x56c/0x5e0 [ 153.567393] ? ksize+0x19/0x30 [ 153.567393] ipvlan_start_xmit+0x18/0x50 [ 153.567393] dev_direct_xmit+0xf3/0x1c0 [ 153.567393] packet_direct_xmit+0x69/0xa0 [ 153.567394] packet_sendmsg+0xbf0/0x19b0 [ 153.567394] ? plist_del+0x62/0xb0 [ 153.567394] sock_sendmsg+0x65/0x70 [ 153.567394] sock_write_iter+0x93/0xf0 [ 153.567394] new_sync_write+0x18e/0x1a0 [ 153.567395] __vfs_write+0x29/0x40 [ 153.567395] vfs_write+0xb9/0x1b0 [ 153.567395] ksys_write+0xb1/0xe0 [ 153.567395] __x64_sys_write+0x1a/0x20 [ 153.567395] do_syscall_64+0x43/0x70 [ 153.567396] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 153.567396] RIP: 0033:0x453549 [ 153.567396] Code: Bad RIP value. [ 153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549 [ 153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003 [ 153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000 [ 153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc [ 153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700 [ 153.567399] ---[ end trace c1d5ae2b1059ec62 ]--- f60e5990d9c1 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: tso: add UDP segmentation supportEric Dumazet
Note that like TCP, we do not support additional encapsulations, and that checksums must be offloaded to the NIC. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: tso: cache transport header lengthEric Dumazet
Add tlen field into struct tso_t, and change tso_start() to return skb_transport_offset(skb) + tso->tlen This removes from callers the need to use tcp_hdrlen(skb) and will ease UDP segmentation offload addition. v2: calls tso_start() earlier in otx2_sq_append_tso() [Jakub] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: tso: constify tso_count_descs() and friendsEric Dumazet
skb argument of tso_count_descs(), tso_build_hdr() and tso_build_data() can be const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: ethtool: add missing NETIF_F_GSO_FRAGLIST feature stringAlexander Lobakin
Commit 3b33583265ed ("net: Add fraglist GRO/GSO feature flags") missed an entry for NETIF_F_GSO_FRAGLIST in netdev_features_strings array. As a result, fraglist GSO feature is not shown in 'ethtool -k' output and can't be toggled on/off. The fix is trivial. Fixes: 3b33583265ed ("net: Add fraglist GRO/GSO feature flags") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: napi: remove useless stack traceEric Dumazet
Whenever a buggy NAPI driver returns more than its budget, we emit a stack trace that is of no use, since it does not tell which driver is buggy. Instead, emit a message giving the function name, and a descriptive message. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18mptcp: drop MP_JOIN request sock on syn cookiesPaolo Abeni
Currently any MPTCP socket using syn cookies will fallback to TCP at 3rd ack time. In case of MP_JOIN requests, the RFC mandate closing the child and sockets, but the existing error paths do not handle the syncookie scenario correctly. Address the issue always forcing the child shutdown in case of MP_JOIN fallback. Fixes: ae2dd7164943 ("mptcp: handle tcp fallback when using syn cookies") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18mptcp: cache msk on MP_JOIN init_reqPaolo Abeni
The msk ownership is transferred to the child socket at 3rd ack time, so that we avoid more lookups later. If the request does not reach the 3rd ack, the MSK reference is dropped at request sock release time. As a side effect, fallback is now tracked by a NULL msk reference instead of zeroed 'mp_join' field. This will simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: Fix the arp error in some casesguodeqing
ie., $ ifconfig eth0 6.6.6.6 netmask 255.255.255.0 $ ip rule add from 6.6.6.6 table 6666 $ ip route add 9.9.9.9 via 6.6.6.6 $ ping -I 6.6.6.6 9.9.9.9 PING 9.9.9.9 (9.9.9.9) from 6.6.6.6 : 56(84) bytes of data. 3 packets transmitted, 0 received, 100% packet loss, time 2079ms $ arp Address HWtype HWaddress Flags Mask Iface 6.6.6.6 (incomplete) eth0 The arp request address is error, this is because fib_table_lookup in fib_check_nh lookup the destnation 9.9.9.9 nexthop, the scope of the fib result is RT_SCOPE_LINK,the correct scope is RT_SCOPE_HOST. Here I add a check of whether this is RT_TABLE_MAIN to solve this problem. Fixes: 3bfd847203c6 ("net: Use passed in table for nexthop lookups") Signed-off-by: guodeqing <geffrey.guo@huawei.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net/sched: act_gate: fix configuration of the periodic timerDavide Caratti
assigning a dummy value of 'clock_id' to avoid cancellation of the cycle timer before its initialization was a temporary solution, and we still need to handle the case where act_gate timer parameters are changed by commands like the following one: # tc action replace action gate <parameters> the fix consists in the following items: 1) remove the workaround assignment of 'clock_id', and init the list of entries before the first error path after IDR atomic check/allocation 2) validate 'clock_id' earlier: there is no need to do IDR atomic check/allocation if we know that 'clock_id' is a bad value 3) use a dedicated function, 'gate_setup_timer()', to ensure that the timer is cancelled and re-initialized on action overwrite, and also ensure we initialize the timer in the error path of tcf_gate_init() v3: improve comment in the error path of tcf_gate_init() (thanks to Vladimir Oltean) v2: avoid 'goto' in gate_setup_timer (thanks to Cong Wang) CC: Ivan Vecera <ivecera@redhat.com> Fixes: a01c245438c5 ("net/sched: fix a couple of splats in the error path of tfc_gate_init()") Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net/sched: act_gate: fix NULL dereference in tcf_gate_init()Davide Caratti
it is possible to see a KASAN use-after-free, immediately followed by a NULL dereference crash, with the following command: # tc action add action gate index 3 cycle-time 100000000ns \ > cycle-time-ext 100000000ns clockid CLOCK_TAI BUG: KASAN: use-after-free in tcf_action_init_1+0x8eb/0x960 Write of size 1 at addr ffff88810a5908bc by task tc/883 CPU: 0 PID: 883 Comm: tc Not tainted 5.7.0+ #188 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 Call Trace: dump_stack+0x75/0xa0 print_address_description.constprop.6+0x1a/0x220 kasan_report.cold.9+0x37/0x7c tcf_action_init_1+0x8eb/0x960 tcf_action_init+0x157/0x2a0 tcf_action_add+0xd9/0x2f0 tc_ctl_action+0x2a3/0x39d rtnetlink_rcv_msg+0x5f3/0x920 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x714/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5b4/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 [...] KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 0 PID: 883 Comm: tc Tainted: G B 5.7.0+ #188 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 RIP: 0010:tcf_action_fill_size+0xa3/0xf0 [....] RSP: 0018:ffff88813a48f250 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: 0000000000000094 RCX: ffffffffa47c3eb6 RDX: 000000000000000e RSI: 0000000000000008 RDI: 0000000000000070 RBP: ffff88810a590800 R08: 0000000000000004 R09: ffffed1027491e03 R10: 0000000000000003 R11: ffffed1027491e03 R12: 0000000000000000 R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88810a590800 FS: 00007f62cae8ce40(0000) GS:ffff888147c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f62c9d20a10 CR3: 000000013a52a000 CR4: 0000000000340ef0 Call Trace: tcf_action_init+0x172/0x2a0 tcf_action_add+0xd9/0x2f0 tc_ctl_action+0x2a3/0x39d rtnetlink_rcv_msg+0x5f3/0x920 netlink_rcv_skb+0x120/0x380 netlink_unicast+0x439/0x630 netlink_sendmsg+0x714/0xbf0 sock_sendmsg+0xe2/0x110 ____sys_sendmsg+0x5b4/0x890 ___sys_sendmsg+0xe9/0x160 __sys_sendmsg+0xd3/0x170 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 this is caused by the test on 'cycletime_ext', that is still unassigned when the action is newly created. This makes the action .init() return 0 without calling tcf_idr_insert(), hence the UAF + crash. rework the logic that prevents zero values of cycle-time, as follows: 1) 'tcfg_cycletime_ext' seems to be unused in the action software path, and it was already possible by other means to obtain non-zero cycletime and zero cycletime-ext. So, removing that test should not cause any damage. 2) while at it, we must prevent overwriting configuration data with wrong ones: use a temporary variable for 'tcfg_cycletime', and validate it preserving the original semantic (that allowed computing the cycle time as the sum of all intervals, when not specified by TCA_GATE_CYCLE_TIME). 3) remove the test on 'tcfg_cycletime', no more useful, and avoid returning -EFAULT, which did not seem an appropriate return value for a wrong netlink attribute. v3: fix uninitialized 'cycletime' (thanks to Vladimir Oltean) v2: remove useless 'return;' at the end of void gate_get_start_time() Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") CC: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18ip_tunnel: fix use-after-free in ip_tunnel_lookup()Taehee Yoo
In the datapath, the ip_tunnel_lookup() is used and it internally uses fallback tunnel device pointer, which is fb_tunnel_dev. This pointer variable should be set to NULL when a fb interface is deleted. But there is no routine to set fb_tunnel_dev pointer to NULL. So, this pointer will be still used after interface is deleted and it eventually results in the use-after-free problem. Test commands: ip netns add A ip netns add B ip link add eth0 type veth peer name eth1 ip link set eth0 netns A ip link set eth1 netns B ip netns exec A ip link set lo up ip netns exec A ip link set eth0 up ip netns exec A ip link add gre1 type gre local 10.0.0.1 \ remote 10.0.0.2 ip netns exec A ip link set gre1 up ip netns exec A ip a a 10.0.100.1/24 dev gre1 ip netns exec A ip a a 10.0.0.1/24 dev eth0 ip netns exec B ip link set lo up ip netns exec B ip link set eth1 up ip netns exec B ip link add gre1 type gre local 10.0.0.2 \ remote 10.0.0.1 ip netns exec B ip link set gre1 up ip netns exec B ip a a 10.0.100.2/24 dev gre1 ip netns exec B ip a a 10.0.0.2/24 dev eth1 ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 & ip netns del B Splat looks like: [ 77.793450][ C3] ================================================================== [ 77.794702][ C3] BUG: KASAN: use-after-free in ip_tunnel_lookup+0xcc4/0xf30 [ 77.795573][ C3] Read of size 4 at addr ffff888060bd9c84 by task hping3/2905 [ 77.796398][ C3] [ 77.796664][ C3] CPU: 3 PID: 2905 Comm: hping3 Not tainted 5.8.0-rc1+ #616 [ 77.797474][ C3] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 77.798453][ C3] Call Trace: [ 77.798815][ C3] <IRQ> [ 77.799142][ C3] dump_stack+0x9d/0xdb [ 77.799605][ C3] print_address_description.constprop.7+0x2cc/0x450 [ 77.800365][ C3] ? ip_tunnel_lookup+0xcc4/0xf30 [ 77.800908][ C3] ? ip_tunnel_lookup+0xcc4/0xf30 [ 77.801517][ C3] ? ip_tunnel_lookup+0xcc4/0xf30 [ 77.802145][ C3] kasan_report+0x154/0x190 [ 77.802821][ C3] ? ip_tunnel_lookup+0xcc4/0xf30 [ 77.803503][ C3] ip_tunnel_lookup+0xcc4/0xf30 [ 77.804165][ C3] __ipgre_rcv+0x1ab/0xaa0 [ip_gre] [ 77.804862][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 77.805621][ C3] gre_rcv+0x304/0x1910 [ip_gre] [ 77.806293][ C3] ? lock_acquire+0x1a9/0x870 [ 77.806925][ C3] ? gre_rcv+0xfe/0x354 [gre] [ 77.807559][ C3] ? erspan_xmit+0x2e60/0x2e60 [ip_gre] [ 77.808305][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 77.809032][ C3] ? rcu_read_lock_held+0x90/0xa0 [ 77.809713][ C3] gre_rcv+0x1b8/0x354 [gre] [ ... ] Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18ip6_gre: fix use-after-free in ip6gre_tunnel_lookup()Taehee Yoo
In the datapath, the ip6gre_tunnel_lookup() is used and it internally uses fallback tunnel device pointer, which is fb_tunnel_dev. This pointer variable should be set to NULL when a fb interface is deleted. But there is no routine to set fb_tunnel_dev pointer to NULL. So, this pointer will be still used after interface is deleted and it eventually results in the use-after-free problem. Test commands: ip netns add A ip netns add B ip link add eth0 type veth peer name eth1 ip link set eth0 netns A ip link set eth1 netns B ip netns exec A ip link set lo up ip netns exec A ip link set eth0 up ip netns exec A ip link add ip6gre1 type ip6gre local fc:0::1 \ remote fc:0::2 ip netns exec A ip -6 a a fc:100::1/64 dev ip6gre1 ip netns exec A ip link set ip6gre1 up ip netns exec A ip -6 a a fc:0::1/64 dev eth0 ip netns exec A ip link set ip6gre0 up ip netns exec B ip link set lo up ip netns exec B ip link set eth1 up ip netns exec B ip link add ip6gre1 type ip6gre local fc:0::2 \ remote fc:0::1 ip netns exec B ip -6 a a fc:100::2/64 dev ip6gre1 ip netns exec B ip link set ip6gre1 up ip netns exec B ip -6 a a fc:0::2/64 dev eth1 ip netns exec B ip link set ip6gre0 up ip netns exec A ping fc:100::2 -s 60000 & ip netns del B Splat looks like: [ 73.087285][ C1] BUG: KASAN: use-after-free in ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.088361][ C1] Read of size 4 at addr ffff888040559218 by task ping/1429 [ 73.089317][ C1] [ 73.089638][ C1] CPU: 1 PID: 1429 Comm: ping Not tainted 5.7.0+ #602 [ 73.090531][ C1] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 73.091725][ C1] Call Trace: [ 73.092160][ C1] <IRQ> [ 73.092556][ C1] dump_stack+0x96/0xdb [ 73.093122][ C1] print_address_description.constprop.6+0x2cc/0x450 [ 73.094016][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.094894][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.095767][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.096619][ C1] kasan_report+0x154/0x190 [ 73.097209][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.097989][ C1] ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre] [ 73.098750][ C1] ? gre_del_protocol+0x60/0x60 [gre] [ 73.099500][ C1] gre_rcv+0x1c5/0x1450 [ip6_gre] [ 73.100199][ C1] ? ip6gre_header+0xf00/0xf00 [ip6_gre] [ 73.100985][ C1] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 73.101830][ C1] ? ip6_input_finish+0x5/0xf0 [ 73.102483][ C1] ip6_protocol_deliver_rcu+0xcbb/0x1510 [ 73.103296][ C1] ip6_input_finish+0x5b/0xf0 [ 73.103920][ C1] ip6_input+0xcd/0x2c0 [ 73.104473][ C1] ? ip6_input_finish+0xf0/0xf0 [ 73.105115][ C1] ? rcu_read_lock_held+0x90/0xa0 [ 73.105783][ C1] ? rcu_read_lock_sched_held+0xc0/0xc0 [ 73.106548][ C1] ipv6_rcv+0x1f1/0x300 [ ... ] Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18net: fix memleak in register_netdevice()Yang Yingliang
I got a memleak report when doing some fuzz test: unreferenced object 0xffff888112584000 (size 13599): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 74 61 70 30 00 00 00 00 00 00 00 00 00 00 00 00 tap0............ 00 ee d9 19 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<000000002f60ba65>] __kmalloc_node+0x309/0x3a0 [<0000000075b211ec>] kvmalloc_node+0x7f/0xc0 [<00000000d3a97396>] alloc_netdev_mqs+0x76/0xfc0 [<00000000609c3655>] __tun_chr_ioctl+0x1456/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff888111845cc0 (size 8): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 8 bytes): 74 61 70 30 00 88 ff ff tap0.... backtrace: [<000000004c159777>] kstrdup+0x35/0x70 [<00000000d8b496ad>] kstrdup_const+0x3d/0x50 [<00000000494e884a>] kvasprintf_const+0xf1/0x180 [<0000000097880a2b>] kobject_set_name_vargs+0x56/0x140 [<000000008fbdfc7b>] dev_set_name+0xab/0xe0 [<000000005b99e3b4>] netdev_register_kobject+0xc0/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff88811886d800 (size 512): comm "ip", pid 3048, jiffies 4294911734 (age 343.491s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff c0 66 3d a3 ff ff ff ff .........f=..... backtrace: [<0000000050315800>] device_add+0x61e/0x1950 [<0000000021008dfb>] netdev_register_kobject+0x17e/0x390 [<00000000602704fe>] register_netdevice+0xb61/0x1250 [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70 [<000000001127ca24>] ksys_ioctl+0xe5/0x130 [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0 [<00000000e1023498>] do_syscall_64+0x56/0xa0 [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 If call_netdevice_notifiers() failed, then rollback_registered() calls netdev_unregister_kobject() which holds the kobject. The reference cannot be put because the netdev won't be add to todo list, so it will leads a memleak, we need put the reference to avoid memleak. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-18bpf: sk_storage: Prefer to get a free cache_idxMartin KaFai Lau
The cache_idx is currently picked by RR. There is chance that the same cache_idx will be picked by multiple sk_storage_maps while other cache_idx is still unused. e.g. It could happen when the sk_storage_map is recreated during the restart of the user space process. This patch tracks the usage count for each cache_idx. There is 16 of them now (defined in BPF_SK_STORAGE_CACHE_SIZE). It will try to pick the free cache_idx. If none was found, it would pick one with the minimal usage count. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200617174226.2301909-1-kafai@fb.com
2020-06-17ethtool: ioctl: Use array_size() in copy_to_user()Gustavo A. R. Silva
Use array_size() helper instead of the open-coded version in copy_to_user(). These sorts of multiplication factors need to be wrapped in array_size(). This issue was found with the help of Coccinelle and, audited and fixed manually. Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17rxrpc: Fix afs large storage transmission performance dropDavid Howells
Commit 2ad6691d988c, which moved the modification of the status annotation for a packet in the Tx buffer prior to the retransmission moved the state clearance, but managed to lose the bit that set it to UNACK. Consequently, if a retransmission occurs, the packet is accidentally changed to the ACK state (ie. 0) by masking it off, which means that the packet isn't counted towards the tally of newly-ACK'd packets if it gets hard-ACK'd. This then prevents the congestion control algorithm from recovering properly. Fix by reinstating the change of state to UNACK. Spotted by the generic/460 xfstest. Fixes: 2ad6691d988c ("rxrpc: Fix race between incoming ACK parser and retransmitter") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-17rxrpc: Fix handling of rwind from an ACK packetDavid Howells
The handling of the receive window size (rwind) from a received ACK packet is not correct. The rxrpc_input_ackinfo() function currently checks the current Tx window size against the rwind from the ACK to see if it has changed, but then limits the rwind size before storing it in the tx_winsize member and, if it increased, wake up the transmitting process. This means that if rwind > RXRPC_RXTX_BUFF_SIZE - 1, this path will always be followed. Fix this by limiting rwind before we compare it to tx_winsize. The effect of this can be seen by enabling the rxrpc_rx_rwind_change tracepoint. Fixes: 702f2ac87a9a ("rxrpc: Wake up the transmitter if Rx window size increases on the peer") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-06-17 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 2 day(s) which contain a total of 14 files changed, 158 insertions(+), 59 deletions(-). The main changes are: 1) Important fix for bpf_probe_read_kernel_str() return value, from Andrii. 2) [gs]etsockopt fix for large optlen, from Stanislav. 3) devmap allocation fix, from Toke. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17xdp: Handle frame_sz in xdp_convert_zc_to_xdp_frame()Hangbin Liu
In commit 34cc0b338a61 we only handled the frame_sz in convert_to_xdp_frame(). This patch will also handle frame_sz in xdp_convert_zc_to_xdp_frame(). Fixes: 34cc0b338a61 ("xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616103518.2963410-1-liuhangbin@gmail.com