summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-10-24xprtrdma: Report the computed connect delayChuck Lever
For debugging, the op_connect trace point should report the computed connect delay. We can then ensure that the delay is computed at the proper times, for example. As a further clean-up, remove a few low-value "heartbeat" trace points in the connect path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Wake tasks after connect worker failsChuck Lever
Pending tasks are currently never awoken when the connect worker fails. The reason is that XPRT_CONNECTED is always clear after a failure return of rpcrdma_ep_connect, thus the xprt_test_and_clear_connected() check in xprt_rdma_connect_worker() always fails. - xprt_rdma_close always clears XPRT_CONNECTED. - rpcrdma_ep_connect always clears XPRT_CONNECTED. After reviewing the TCP connect worker, it appears that there's no need for extra test_and_set paranoia in xprt_rdma_connect_worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Pull up sometimesChuck Lever
On some platforms, DMA mapping part of a page is more costly than copying bytes. Restore the pull-up code and use that when we think it's going to be faster. The heuristic for now is to pull-up when the size of the RPC message body fits in the buffer underlying the head iovec. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated, as is handling a Send completion, for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Refactor rpcrdma_prepare_msg_sges()Chuck Lever
Refactor: Replace spaghetti with code that makes it plain what needs to be done for each rtype. This makes it easier to add features and optimizations. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Move the rpcrdma_sendctx::sc_wr fieldChuck Lever
Clean up: This field is not needed in the Send completion handler, so it can be moved to struct rpcrdma_req to reduce the size of struct rpcrdma_sendctx, and to reduce the amount of memory that is sloshed between the sending process and the Send completion process. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Remove rpcrdma_sendctx::sc_deviceChuck Lever
Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Remove rpcrdma_sendctx::sc_xprtChuck Lever
Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Ensure ri_id is stable during MR recyclingChuck Lever
ia->ri_id is replaced during a reconnect. The connect_worker runs with the transport send lock held to prevent ri_id from being dereferenced by the send_request path during this process. Currently, however, there is no guarantee that ia->ri_id is stable in the MR recycling worker, which operates in the background and is not serialized with the connect_worker in any way. But now that Local_Inv completions are being done in process context, we can handle the recycling operation there instead of deferring the recycling work to another process. Because the disconnect path drains all work before allowing tear down to proceed, it is guaranteed that Local Invalidations complete only while the ri_id pointer is stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Manage MRs in context of a single connectionChuck Lever
MRs are now allocated on demand so we can safely throw them away on disconnect. This way an idle transport can disconnect and it won't pin hardware MR resources. Two additional changes: - Now that all MRs are destroyed on disconnect, there's no need to check during header marshaling if a req has MRs to recycle. Each req is sent only once per connection, and now rl_registered is guaranteed to be empty when rpcrdma_marshal_req is invoked. - Because MRs are now destroyed in a WQ_MEM_RECLAIM context, they also must be allocated in a WQ_MEM_RECLAIM context. This reduces the likelihood that device driver memory allocation will trigger memory reclaim during NFS writeback. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Fix MR list handlingChuck Lever
Close some holes introduced by commit 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") that could result in list corruption. In addition, the result that is tabulated in @count is no longer used, so @count is removed. Fixes: 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Close window between waking RPC senders and posting ReceivesChuck Lever
A recent clean up attempted to separate Receive handling and RPC Reply processing, in the name of clean layering. Unfortunately, we can't do this because the Receive Queue has to be refilled _after_ the most recent credit update from the responder is parsed from the transport header, but _before_ we wake up the next RPC sender. That is right in the middle of rpcrdma_reply_handler(). Usually this isn't a problem because current responder implementations don't vary their credit grant. The one exception is when a connection is established: the grant goes from one to a much larger number on the first Receive. The requester MUST post enough Receives right then so that any outstanding requests can be sent without risking RNR and connection loss. Fixes: 6ceea36890a0 ("xprtrdma: Refactor Receive accounting") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Initialize rb_credits in one placeChuck Lever
Clean up/code de-duplication. Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd. This mistake does not appear to have operational consequences, since the cwnd value is replaced with a valid value upon the first Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Connection becomes unstable after a reconnectChuck Lever
This is because xprt_request_get_cong() is allowing more than one RPC Call to be transmitted before the first Receive on the new connection. The first Receive fills the Receive Queue based on the server's credit grant. Before that Receive, there is only a single Receive WR posted because the client doesn't know the server's credit grant. Solution is to clear rq_cong on all outstanding rpc_rqsts when the the cwnd is reset. This is because an RPC/RDMA credit is good for one connection instance only. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Add unique trace points for posting Local Invalidate WRsChuck Lever
When adding frwr_unmap_async way back when, I re-used the existing trace_xprtrdma_post_send() trace point to record the return code of ib_post_send. Unfortunately there are some cases where re-using that trace point causes a crash. Instead, construct a trace point specific to posting Local Invalidate WRs that will always be safe to use in that context, and will act as a trace log eye-catcher for Local Invalidation. Fixes: 847568942f93 ("xprtrdma: Remove fr_state") Fixes: d8099feda483 ("xprtrdma: Reduce context switching due ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Bill Baker <bill.baker@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24SUNRPC: Add trace points to observe transport congestion controlChuck Lever
To help debug problems with RPC/RDMA credit management, replace dprintk() call sites in the transport send lock paths with trace events. Similar trace points are defined for the non-congestion paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24SUNRPC: Eliminate log noise in call_reserveresultChuck Lever
Sep 11 16:35:20 manet kernel: call_reserveresult: unrecognized error -512, exiting Diagnostic error messages such as this likely have no value for NFS client administrators. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24netfilter: nft_payload: fix missing check for matching length in offloadswenxu
Payload offload rule should also check the length of the match. Moreover, check for unsupported link-layer fields: nft --debug=netlink add rule firewall zones vlan id 100 ... [ payload load 2b @ link header + 0 => reg 1 ] this loads 2byte base on ll header and offset 0. This also fixes unsupported raw payload match. Fixes: 92ad6325cb89 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-24ipvs: move old_secure_tcp into struct netns_ipvsEric Dumazet
syzbot reported the following issue : BUG: KCSAN: data-race in update_defense_level / update_defense_level read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1: update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0: update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events defense_work_handler Indeed, old_secure_tcp is currently a static variable, while it needs to be a per netns variable. Fixes: a0840e2e165a ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Simon Horman <horms@verge.net.au>
2019-10-24ipvs: don't ignore errors in case refcounting ip_vs module failsDavide Caratti
if the IPVS module is removed while the sync daemon is starting, there is a small gap where try_module_get() might fail getting the refcount inside ip_vs_use_count_inc(). Then, the refcounts of IPVS module are unbalanced, and the subsequent call to stop_sync_thread() causes the following splat: WARNING: CPU: 0 PID: 4013 at kernel/module.c:1146 module_put.part.44+0x15b/0x290 Modules linked in: ip_vs(-) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth ip6table_filter ip6_tables iptable_filter binfmt_misc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ext4 mbcache jbd2 ghash_clmulni_intel snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd cryptd glue_helper joydev pcspkr snd_timer virtio_balloon snd soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk failover virtio_console qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_piix ttm crc32c_intel serio_raw drm virtio_pci libata virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_defrag_ipv6] CPU: 0 PID: 4013 Comm: modprobe Tainted: G W 5.4.0-rc1.upstream+ #741 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:module_put.part.44+0x15b/0x290 Code: 04 25 28 00 00 00 0f 85 18 01 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 89 44 24 28 83 e8 01 89 c5 0f 89 57 ff ff ff <0f> 0b e9 78 ff ff ff 65 8b 1d 67 83 26 4a 89 db be 08 00 00 00 48 RSP: 0018:ffff888050607c78 EFLAGS: 00010297 RAX: 0000000000000003 RBX: ffffffffc1420590 RCX: ffffffffb5db0ef9 RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffffc1420590 RBP: 00000000ffffffff R08: fffffbfff82840b3 R09: fffffbfff82840b3 R10: 0000000000000001 R11: fffffbfff82840b2 R12: 1ffff1100a0c0f90 R13: ffffffffc1420200 R14: ffff88804f533300 R15: ffff88804f533ca0 FS: 00007f8ea9720740(0000) GS:ffff888053800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3245abe000 CR3: 000000004c28a006 CR4: 00000000001606f0 Call Trace: stop_sync_thread+0x3a3/0x7c0 [ip_vs] ip_vs_sync_net_cleanup+0x13/0x50 [ip_vs] ops_exit_list.isra.5+0x94/0x140 unregister_pernet_operations+0x29d/0x460 unregister_pernet_device+0x26/0x60 ip_vs_cleanup+0x11/0x38 [ip_vs] __x64_sys_delete_module+0x2d5/0x400 do_syscall_64+0xa5/0x4e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f8ea8bf0db7 Code: 73 01 c3 48 8b 0d b9 80 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 80 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffcd38d2fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000000002436240 RCX: 00007f8ea8bf0db7 RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00000000024362a8 RBP: 0000000000000000 R08: 00007f8ea8eba060 R09: 00007f8ea8c658a0 R10: 00007ffcd38d2a60 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000001 R14: 00000000024362a8 R15: 0000000000000000 irq event stamp: 4538 hardirqs last enabled at (4537): [<ffffffffb6193dde>] quarantine_put+0x9e/0x170 hardirqs last disabled at (4538): [<ffffffffb5a0556a>] trace_hardirqs_off_thunk+0x1a/0x20 softirqs last enabled at (4522): [<ffffffffb6f8ebe9>] sk_common_release+0x169/0x2d0 softirqs last disabled at (4520): [<ffffffffb6f8eb3e>] sk_common_release+0xbe/0x2d0 Check the return value of ip_vs_use_count_inc() and let its caller return proper error. Inside do_ip_vs_set_ctl() the module is already refcounted, we don't need refcount/derefcount there. Finally, in register_ip_vs_app() and start_sync_thread(), take the module refcount earlier and ensure it's released in the error path. Change since v1: - better return values in case of failure of ip_vs_use_count_inc(), thanks to Julian Anastasov - no need to increase/decrease the module refcount in ip_vs_set_ctl(), thanks to Julian Anastasov Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2019-10-23xsk: Fix registration of Rx-only socketsMagnus Karlsson
Having Rx-only AF_XDP sockets can potentially lead to a crash in the system by a NULL pointer dereference in xsk_umem_consume_tx(). This function iterates through a list of all sockets tied to a umem and checks if there are any packets to send on the Tx ring. Rx-only sockets do not have a Tx ring, so this will cause a NULL pointer dereference. This will happen if you have registered one or more Rx-only sockets to a umem and the driver is checking the Tx ring even on Rx, or if the XDP_SHARED_UMEM mode is used and there is a mix of Rx-only and other sockets tied to the same umem. Fixed by only putting sockets with a Tx component on the list that xsk_umem_consume_tx() iterates over. Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions") Reported-by: Kal Cutter Conley <kal.conley@dectris.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/bpf/1571645818-16244-1-git-send-email-magnus.karlsson@intel.com
2019-10-23net/flow_dissector: switch to siphashEric Dumazet
UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23compat_ioctl: move SIOCOUTQ out of compat_ioctl.cArnd Bergmann
All users of this call are in socket or tty code, so handling it there means we can avoid the table entry in fs/compat_ioctl.c. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: handle SIOCOUTQNSDArnd Bergmann
Unlike the normal SIOCOUTQ, SIOCOUTQNSD was never handled in compat mode. Add it to the common socket compat handler along with similar ones. Fixes: 2f4e1b397097 ("tcp: ioctl type SIOCOUTQNSD returns amount of data not sent") Cc: Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23af_unix: add compat_ioctl supportArnd Bergmann
The af_unix protocol family has a custom ioctl command (inexplicibly based on SIOCPROTOPRIVATE), but never had a compat_ioctl handler for 32-bit applications. Since all commands are compatible here, add a trivial wrapper that performs the compat_ptr() conversion for SIOCOUTQ/SIOCINQ. SIOCUNIXFILE does not use the argument, but it doesn't hurt to also use compat_ptr() here. Fixes: ba94f3088b79 ("unix: add ioctl to open a unix socket file with O_PATH") Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move hci_sock handlers into driverArnd Bergmann
All these ioctl commands are compatible, so we can handle them with a trivial wrapper in hci_sock.c and remove the listing in fs/compat_ioctl.c. A few of the commands pass integer arguments instead of pointers, so for correctness skip the compat_ptr() conversion here. Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move rfcomm handlers into driverArnd Bergmann
All these ioctl commands are compatible, so we can handle them with a trivial wrapper in rfcomm/sock.c and remove the listing in fs/compat_ioctl.c. Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23compat_ioctl: move more drivers to compat_ptr_ioctlArnd Bergmann
The .ioctl and .compat_ioctl file operations have the same prototype so they can both point to the same function, which works great almost all the time when all the commands are compatible. One exception is the s390 architecture, where a compat pointer is only 31 bit wide, and converting it into a 64-bit pointer requires calling compat_ptr(). Most drivers here will never run in s390, but since we now have a generic helper for it, it's easy enough to use it consistently. I double-checked all these drivers to ensure that all ioctl arguments are used as pointers or are ignored, but are not interpreted as integer values. Acked-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Sterba <dsterba@suse.com> Acked-by: Darren Hart (VMware) <dvhart@infradead.org> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23netfilter: nf_tables_offload: restore basechain deletionPablo Neira Ayuso
Unbind callbacks on chain deletion. Fixes: 8fc618c52d16 ("netfilter: nf_tables_offload: refactor the nft_flow_offload_chain function") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_flow_table: set timeout before insertion into hashesPablo Neira Ayuso
Other garbage collector might remove an entry not fully set up yet. [570953.958293] RIP: 0010:memcmp+0x9/0x50 [...] [570953.958567] flow_offload_hash_cmp+0x1e/0x30 [nf_flow_table] [570953.958585] flow_offload_lookup+0x8c/0x110 [nf_flow_table] [570953.958606] nf_flow_offload_ip_hook+0x135/0xb30 [nf_flow_table] [570953.958624] nf_flow_offload_inet_hook+0x35/0x37 [nf_flow_table_inet] [570953.958646] nf_hook_slow+0x3c/0xb0 [570953.958664] __netif_receive_skb_core+0x90f/0xb10 [570953.958678] ? ip_rcv_finish+0x82/0xa0 [570953.958692] __netif_receive_skb_one_core+0x3b/0x80 [570953.958711] __netif_receive_skb+0x18/0x60 [570953.958727] netif_receive_skb_internal+0x45/0xf0 [570953.958741] napi_gro_receive+0xcd/0xf0 [570953.958764] ixgbe_clean_rx_irq+0x432/0xe00 [ixgbe] [570953.958782] ixgbe_poll+0x27b/0x700 [ixgbe] [570953.958796] net_rx_action+0x284/0x3c0 [570953.958817] __do_softirq+0xcc/0x27c [570953.959464] irq_exit+0xe8/0x100 [570953.960097] do_IRQ+0x59/0xe0 [570953.960734] common_interrupt+0xf/0xf Fixes: 43c8f131184f ("netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables: support for multiple devices per netdev hookPablo Neira Ayuso
This patch allows you to register one netdev basechain to multiple devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify the list of netdevices. Basechains store a list of hooks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables_offload: remove rules on unregistered device onlyPablo Neira Ayuso
After unbinding the list of flow_block callbacks, iterate over it to remove the existing rules in the netdevice that has just been unregistered. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables_offload: add nft_flow_cls_offload_setup()Pablo Neira Ayuso
Add helper function to set up the flow_cls_offload object. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables_offload: Pass callback list to nft_setup_cb_call()Pablo Neira Ayuso
This allows to reuse nft_setup_cb_call() from the callback unbind path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables_offload: add nft_flow_block_chain()Pablo Neira Ayuso
Add nft_flow_block_chain() helper function to reuse this function from netdev event handler. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables: increase maximum devices number per flowtablePablo Neira Ayuso
Rise the maximum limit of devices per flowtable up to 256. Rename NFT_FLOWTABLE_DEVICE_MAX to NFT_NETDEVICE_MAX in preparation to reuse the netdev hook parser for ingress basechain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables: allow netdevice to be used only once per flowtablePablo Neira Ayuso
Allow netdevice only once per flowtable, otherwise hit EEXIST. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_tables: dynamically allocate hooks per net_device in flowtablesPablo Neira Ayuso
Use a list of hooks per device instead an array. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23netfilter: nf_flow_table: move priority to struct nf_flowtablePablo Neira Ayuso
Hardware offload needs access to the priority field, store this field in the nf_flowtable object. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-22fq_codel: do not include <linux/jhash.h>Eric Dumazet
Since commit 342db221829f ("sched: Call skb_get_hash_perturb in sch_fq_codel") we no longer need anything from this file. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22ipv6: include <net/addrconf.h> for missing declarationsBen Dooks (Codethink)
Include <net/addrconf.h> for the missing declarations of various functions. Fixes the following sparse warnings: net/ipv6/addrconf_core.c:94:5: warning: symbol 'register_inet6addr_notifier' was not declared. Should it be static? net/ipv6/addrconf_core.c:100:5: warning: symbol 'unregister_inet6addr_notifier' was not declared. Should it be static? net/ipv6/addrconf_core.c:106:5: warning: symbol 'inet6addr_notifier_call_chain' was not declared. Should it be static? net/ipv6/addrconf_core.c:112:5: warning: symbol 'register_inet6addr_validator_notifier' was not declared. Should it be static? net/ipv6/addrconf_core.c:118:5: warning: symbol 'unregister_inet6addr_validator_notifier' was not declared. Should it be static? net/ipv6/addrconf_core.c:125:5: warning: symbol 'inet6addr_validator_notifier_call_chain' was not declared. Should it be static? net/ipv6/addrconf_core.c:237:6: warning: symbol 'in6_dev_finish_destroy' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: openvswitch: free vport unless register_netdevice() succeedsHillf Danton
syzbot found the following crash on: HEAD commit: 1e78030e Merge tag 'mmc-v5.3-rc1' of git://git.kernel.org/.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=148d3d1a600000 kernel config: https://syzkaller.appspot.com/x/.config?x=30cef20daf3e9977 dashboard link: https://syzkaller.appspot.com/bug?extid=13210896153522fe1ee5 compiler: gcc (GCC) 9.0.0 20181231 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=136aa8c4600000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=109ba792600000 ===================================================================== BUG: memory leak unreferenced object 0xffff8881207e4100 (size 128): comm "syz-executor032", pid 7014, jiffies 4294944027 (age 13.830s) hex dump (first 32 bytes): 00 70 16 18 81 88 ff ff 80 af 8c 22 81 88 ff ff .p.........".... 00 b6 23 17 81 88 ff ff 00 00 00 00 00 00 00 00 ..#............. backtrace: [<000000000eb78212>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<000000000eb78212>] slab_post_alloc_hook mm/slab.h:522 [inline] [<000000000eb78212>] slab_alloc mm/slab.c:3319 [inline] [<000000000eb78212>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3548 [<00000000006ea6c6>] kmalloc include/linux/slab.h:552 [inline] [<00000000006ea6c6>] kzalloc include/linux/slab.h:748 [inline] [<00000000006ea6c6>] ovs_vport_alloc+0x37/0xf0 net/openvswitch/vport.c:130 [<00000000f9a04a7d>] internal_dev_create+0x24/0x1d0 net/openvswitch/vport-internal_dev.c:164 [<0000000056ee7c13>] ovs_vport_add+0x81/0x190 net/openvswitch/vport.c:199 [<000000005434efc7>] new_vport+0x19/0x80 net/openvswitch/datapath.c:194 [<00000000b7b253f1>] ovs_dp_cmd_new+0x22f/0x410 net/openvswitch/datapath.c:1614 [<00000000e0988518>] genl_family_rcv_msg+0x2ab/0x5b0 net/netlink/genetlink.c:629 [<00000000d0cc9347>] genl_rcv_msg+0x54/0x9c net/netlink/genetlink.c:654 [<000000006694b647>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477 [<0000000088381f37>] genl_rcv+0x29/0x40 net/netlink/genetlink.c:665 [<00000000dad42a47>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] [<00000000dad42a47>] netlink_unicast+0x1ec/0x2d0 net/netlink/af_netlink.c:1328 [<0000000067e6b079>] netlink_sendmsg+0x270/0x480 net/netlink/af_netlink.c:1917 [<00000000aab08a47>] sock_sendmsg_nosec net/socket.c:637 [inline] [<00000000aab08a47>] sock_sendmsg+0x54/0x70 net/socket.c:657 [<000000004cb7c11d>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2311 [<00000000c4901c63>] __sys_sendmsg+0x80/0xf0 net/socket.c:2356 [<00000000c10abb2d>] __do_sys_sendmsg net/socket.c:2365 [inline] [<00000000c10abb2d>] __se_sys_sendmsg net/socket.c:2363 [inline] [<00000000c10abb2d>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2363 BUG: memory leak unreferenced object 0xffff88811723b600 (size 64): comm "syz-executor032", pid 7014, jiffies 4294944027 (age 13.830s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 02 00 00 00 05 35 82 c1 .............5.. backtrace: [<00000000352f46d8>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<00000000352f46d8>] slab_post_alloc_hook mm/slab.h:522 [inline] [<00000000352f46d8>] slab_alloc mm/slab.c:3319 [inline] [<00000000352f46d8>] __do_kmalloc mm/slab.c:3653 [inline] [<00000000352f46d8>] __kmalloc+0x169/0x300 mm/slab.c:3664 [<000000008e48f3d1>] kmalloc include/linux/slab.h:557 [inline] [<000000008e48f3d1>] ovs_vport_set_upcall_portids+0x54/0xd0 net/openvswitch/vport.c:343 [<00000000541e4f4a>] ovs_vport_alloc+0x7f/0xf0 net/openvswitch/vport.c:139 [<00000000f9a04a7d>] internal_dev_create+0x24/0x1d0 net/openvswitch/vport-internal_dev.c:164 [<0000000056ee7c13>] ovs_vport_add+0x81/0x190 net/openvswitch/vport.c:199 [<000000005434efc7>] new_vport+0x19/0x80 net/openvswitch/datapath.c:194 [<00000000b7b253f1>] ovs_dp_cmd_new+0x22f/0x410 net/openvswitch/datapath.c:1614 [<00000000e0988518>] genl_family_rcv_msg+0x2ab/0x5b0 net/netlink/genetlink.c:629 [<00000000d0cc9347>] genl_rcv_msg+0x54/0x9c net/netlink/genetlink.c:654 [<000000006694b647>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477 [<0000000088381f37>] genl_rcv+0x29/0x40 net/netlink/genetlink.c:665 [<00000000dad42a47>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] [<00000000dad42a47>] netlink_unicast+0x1ec/0x2d0 net/netlink/af_netlink.c:1328 [<0000000067e6b079>] netlink_sendmsg+0x270/0x480 net/netlink/af_netlink.c:1917 [<00000000aab08a47>] sock_sendmsg_nosec net/socket.c:637 [inline] [<00000000aab08a47>] sock_sendmsg+0x54/0x70 net/socket.c:657 [<000000004cb7c11d>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2311 [<00000000c4901c63>] __sys_sendmsg+0x80/0xf0 net/socket.c:2356 BUG: memory leak unreferenced object 0xffff8881228ca500 (size 128): comm "syz-executor032", pid 7015, jiffies 4294944622 (age 7.880s) hex dump (first 32 bytes): 00 f0 27 18 81 88 ff ff 80 ac 8c 22 81 88 ff ff ..'........".... 40 b7 23 17 81 88 ff ff 00 00 00 00 00 00 00 00 @.#............. backtrace: [<000000000eb78212>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<000000000eb78212>] slab_post_alloc_hook mm/slab.h:522 [inline] [<000000000eb78212>] slab_alloc mm/slab.c:3319 [inline] [<000000000eb78212>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3548 [<00000000006ea6c6>] kmalloc include/linux/slab.h:552 [inline] [<00000000006ea6c6>] kzalloc include/linux/slab.h:748 [inline] [<00000000006ea6c6>] ovs_vport_alloc+0x37/0xf0 net/openvswitch/vport.c:130 [<00000000f9a04a7d>] internal_dev_create+0x24/0x1d0 net/openvswitch/vport-internal_dev.c:164 [<0000000056ee7c13>] ovs_vport_add+0x81/0x190 net/openvswitch/vport.c:199 [<000000005434efc7>] new_vport+0x19/0x80 net/openvswitch/datapath.c:194 [<00000000b7b253f1>] ovs_dp_cmd_new+0x22f/0x410 net/openvswitch/datapath.c:1614 [<00000000e0988518>] genl_family_rcv_msg+0x2ab/0x5b0 net/netlink/genetlink.c:629 [<00000000d0cc9347>] genl_rcv_msg+0x54/0x9c net/netlink/genetlink.c:654 [<000000006694b647>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477 [<0000000088381f37>] genl_rcv+0x29/0x40 net/netlink/genetlink.c:665 [<00000000dad42a47>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] [<00000000dad42a47>] netlink_unicast+0x1ec/0x2d0 net/netlink/af_netlink.c:1328 [<0000000067e6b079>] netlink_sendmsg+0x270/0x480 net/netlink/af_netlink.c:1917 [<00000000aab08a47>] sock_sendmsg_nosec net/socket.c:637 [inline] [<00000000aab08a47>] sock_sendmsg+0x54/0x70 net/socket.c:657 [<000000004cb7c11d>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2311 [<00000000c4901c63>] __sys_sendmsg+0x80/0xf0 net/socket.c:2356 [<00000000c10abb2d>] __do_sys_sendmsg net/socket.c:2365 [inline] [<00000000c10abb2d>] __se_sys_sendmsg net/socket.c:2363 [inline] [<00000000c10abb2d>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2363 ===================================================================== The function in net core, register_netdevice(), may fail with vport's destruction callback either invoked or not. After commit 309b66970ee2 ("net: openvswitch: do not free vport if register_netdevice() is failed."), the duty to destroy vport is offloaded from the driver OTOH, which ends up in the memory leak reported. It is fixed by releasing vport unless device is registered successfully. To do that, the callback assignment is defered until device is registered. Reported-by: syzbot+13210896153522fe1ee5@syzkaller.appspotmail.com Fixes: 309b66970ee2 ("net: openvswitch: do not free vport if register_netdevice() is failed.") Cc: Taehee Yoo <ap420073@gmail.com> Cc: Greg Rose <gvrose8192@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Ying Xue <ying.xue@windriver.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> [sbrivio: this was sent to dev@openvswitch.org and never made its way to netdev -- resending original patch] Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: sched: taprio: fix -Wmissing-prototypes warningsYi Wang
We get one warnings when build kernel W=1: net/sched/sch_taprio.c:1155:6: warning: no previous prototype for ‘taprio_offload_config_changed’ [-Wmissing-prototypes] Make the function static to fix this. Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading") Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: remove dsa_switch_alloc helperVivien Didelot
Now that ports are dynamically listed in the fabric, there is no need to provide a special helper to allocate the dsa_switch structure. This will give more flexibility to drivers to embed this structure as they wish in their private structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: allocate ports on touchVivien Didelot
Allocate the struct dsa_port the first time it is accessed with dsa_port_touch, and remove the static dsa_port array from the dsa_switch structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list to setup default CPU portVivien Didelot
Use the new ports list instead of iterating over switches and their ports when setting up the default CPU port. Unassign it on teardown. Now that we can iterate over multiple CPU ports, remove dst->cpu_dp. At the same time, provide a better error message for CPU-less tree. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list to find first CPU portVivien Didelot
Use the new ports list instead of iterating over switches and their ports when looking up the first CPU port in the tree. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list to setup multiple master devicesVivien Didelot
Now that we have a potential list of CPU ports, make use of it instead of only configuring the master device of an unique CPU port. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list to find a port by nodeVivien Didelot
Use the new ports list instead of iterating over switches and their ports to find a port from a given node. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list for routing table setupVivien Didelot
Use the new ports list instead of accessing the dsa_switch array of ports when iterating over DSA ports of a switch to set up the routing table. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22net: dsa: use ports list to setup switchesVivien Didelot
Use the new ports list instead of iterating over switches and their ports when setting up the switches and their ports. At the same time, provide setup states and messages for ports and switches as it is done for the trees. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>