summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)Author
2020-04-17svcrdma: Fix leak of svc_rdma_recv_ctxt objectsChuck Lever
Utilize the xpo_release_rqst transport method to ensure that each rqstp's svc_rdma_recv_ctxt object is released even when the server cannot return a Reply for that rqstp. Without this fix, each RPC whose Reply cannot be sent leaks one svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped Receive buffer, and any pages that might be part of the Reply message. The leak is infrequent unless the network fabric is unreliable or Kerberos is in use, as GSS sequence window overruns, which result in connection loss, are more common on fast transports. Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17svcrdma: Fix trace point use-after-free raceChuck Lever
I hit this while testing nfsd-5.7 with kernel memory debugging enabled on my server: Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060 Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591 Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma] Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 <8b> 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00 Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286 Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030 Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246 Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004 Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80 Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000 Mar 30 13:21:45 klimt kernel: FS: 0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000 Mar 30 13:21:45 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0 Mar 30 13:21:45 klimt kernel: Call Trace: Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma] Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd] Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc] Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd] Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74 Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50 Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]--- It's absolutely not safe to use resources pointed to by the @send_wr argument of ib_post_send() _after_ that function returns. Those resources are typically freed by the Send completion handler, which can run before ib_post_send() returns. Thus the trace points currently around ib_post_send() in the server's RPC/RDMA transport are a hazard, even when they are disabled. Rearrange them so that they touch the Work Request only _before_ ib_post_send() is invoked. Fixes: bd2abef33394 ("svcrdma: Trace key RDMA API events") Fixes: 4201c7464753 ("svcrdma: Introduce svc_rdma_send_ctxt") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17SUNRPC: Fix backchannel RPC soft lockupsChuck Lever
Currently, after the forward channel connection goes away, backchannel operations are causing soft lockups on the server because call_transmit_status's SOFTCONN logic ignores ENOTCONN. Such backchannel Calls are aggressively retried until the client reconnects. Backchannel Calls should use RPC_TASK_NOCONNECT rather than RPC_TASK_SOFTCONN. If there is no forward connection, the server is not capable of establishing a connection back to the client, thus that backchannel request should fail before the server attempts to send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN") was merged several years before RPC_TASK_NOCONNECT was available. Because setup_callback_client() explicitly sets NOPING, the NFSv4.0 callback connection depends on the first callback RPC to initiate a connection to the client. Thus NFSv4.0 needs to continue to use RPC_TASK_SOFTCONN. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # v4.20+
2020-04-07Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a page leak in nfs_destroy_unlinked_subrequests() - Fix use-after-free issues in nfs_pageio_add_request() - Fix new mount code constant_table array definitions - finish_automount() requires us to hold 2 refs to the mount record Features: - Improve the accuracy of telldir/seekdir by using 64-bit cookies when possible. - Allow one RDMA active connection and several zombie connections to prevent blocking if the remote server is unresponsive. - Limit the size of the NFS access cache by default - Reduce the number of references to credentials that are taken by NFS - pNFS files and flexfiles drivers now support per-layout segment COMMIT lists. - Enable partial-file layout segments in the pNFS/flexfiles driver. - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from the DS using the layouterror mechanism. Bugfixes and cleanups: - SUNRPC: Fix krb5p regressions - Don't specify NFS version in "UDP not supported" error - nfsroot: set tcp as the default transport protocol - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid() - alloc_nfs_open_context() must use the file cred when available - Fix locking when dereferencing the delegation cred - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails - Various clean ups of the NFS O_DIRECT commit code - Clean up RDMA connect/disconnect - Replace zero-length arrays with C99-style flexible arrays" * tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits) NFS: Clean up process of marking inode stale. SUNRPC: Don't start a timer on an already queued rpc task NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn() NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode() NFS: Beware when dereferencing the delegation cred NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout NFS: finish_automount() requires us to hold 2 refs to the mount record NFS: Fix a few constant_table array definitions NFS: Try to join page groups before an O_DIRECT retransmission NFS: Refactor nfs_lock_and_join_requests() NFS: Reverse the submission order of requests in __nfs_pageio_add_request() NFS: Clean up nfs_lock_and_join_requests() NFS: Remove the redundant function nfs_pgio_has_mirroring() NFS: Fix memory leaks in nfs_pageio_stop_mirroring() NFS: Fix a request reference leak in nfs_direct_write_clear_reqs() NFS: Fix use-after-free issues in nfs_pageio_add_request() NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests() NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio() pNFS/flexfiles: Specify the layout segment range in LAYOUTGET ...
2020-03-27svcrdma: Fix leak of transport addressesChuck Lever
Kernel memory leak detected: unreferenced object 0xffff888849cdf480 (size 8): comm "kworker/u8:3", pid 2086, jiffies 4297898756 (age 4269.856s) hex dump (first 8 bytes): 30 00 cd 49 88 88 ff ff 0..I.... backtrace: [<00000000acfc370b>] __kmalloc_track_caller+0x137/0x183 [<00000000a2724354>] kstrdup+0x2b/0x43 [<0000000082964f84>] xprt_rdma_format_addresses+0x114/0x17d [rpcrdma] [<00000000dfa6ed00>] xprt_setup_rdma_bc+0xc0/0x10c [rpcrdma] [<0000000073051a83>] xprt_create_transport+0x3f/0x1a0 [sunrpc] [<0000000053531a8e>] rpc_create+0x118/0x1cd [sunrpc] [<000000003a51b5f8>] setup_callback_client+0x1a5/0x27d [nfsd] [<000000001bd410af>] nfsd4_process_cb_update.isra.7+0x16c/0x1ac [nfsd] [<000000007f4bbd56>] nfsd4_run_cb_work+0x4c/0xbd [nfsd] [<0000000055c5586b>] process_one_work+0x1b2/0x2fe [<00000000b1e3e8ef>] worker_thread+0x1a6/0x25a [<000000005205fb78>] kthread+0xf6/0xfb [<000000006d2dc057>] ret_from_fork+0x3a/0x50 Introduce a call to xprt_rdma_free_addresses() similar to the way that the TCP backchannel releases a transport's peer address strings. Fixes: 5d252f90a800 ("svcrdma: Add class for RDMA backwards direction transport") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprtChuck Lever
Change the rpcrdma_xprt_disconnect() function so that it no longer waits for the DISCONNECTED event. This prevents blocking if the remote is unresponsive. In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is detached. Upon return from rpcrdma_xprt_disconnect(), the transport (r_xprt) is ready immediately for a new connection. The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now handled almost identically. However, because the lifetimes of rpcrdma_xprt structures and rpcrdma_ep structures are now independent, creating an rpcrdma_ep needs to take a module ref count. The ep now owns most of the hardware resources for a transport. Also, a kref is needed to ensure that rpcrdma_ep sticks around long enough for the cm_event_handler to finish. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Extract sockaddr from struct rdma_cm_idChuck Lever
rpcrdma_cm_event_handler() is always passed an @id pointer that is valid. However, in a subsequent patch, we won't be able to extract an r_xprt in every case. So instead of using the r_xprt's presentation address strings, extract them from struct rdma_cm_id. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_epChuck Lever
I eventually want to allocate rpcrdma_ep separately from struct rpcrdma_xprt so that on occasion there can be more than one ep per xprt. The new struct rpcrdma_ep will contain all the fields currently in rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings for the connection, in addition to per-connection settings negotiated with the remote. Take this opportunity to rename the existing ep fields from rep_* to re_* to disambiguate these from struct rpcrdma_rep. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Disconnect on flushed completionChuck Lever
Completion errors after a disconnect often occur much sooner than a CM_DISCONNECT event. Use this to try to detect connection loss more quickly. Note that other kernel ULPs do take care to disconnect explicitly when a WR is flushed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Remove rpcrdma_ia::ri_flagsChuck Lever
Clean up: The upper layer serializes calls to xprt_rdma_close, so there is no need for an atomic bit operation, saving 8 bytes in rpcrdma_ia. This enables merging rpcrdma_ia_remove directly into the disconnect logic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ia_open in the connect workerChuck Lever
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now responsible for allocating all per-connection hardware resources. With this clean-up, all three arms of the switch statement in rpcrdma_ep_connect are exactly the same now, thus the switch can be removed. Because device removal behaves a little differently than disconnection, there is a little more work to be done before rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So it is not quite symmetrical with rpcrdma_ep_create() yet. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()Chuck Lever
Make a Protection Domain (PD) a per-connection resource rather than a per-transport resource. In other words, when the connection terminates, the PD is destroyed. Thus there is one less HW resource that remains allocated to a transport after a connection is closed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()Chuck Lever
Clean up: Simplify the synopses of functions in the connect and disconnect paths in preparation for combining the rpcrdma_ia and struct rpcrdma_ep structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Clean up the post_send pathChuck Lever
Clean up: Simplify the synopses of functions in the post_send path by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Refactor frwr_init_mr()Chuck Lever
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep structures. Take the opportunity to rename the function to be consistent with the "subsystem _ object _ verb" naming scheme. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ep_create() in the connect workerChuck Lever
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and rpcrdma_ep_destroy(). rpcrdma_ep_create will be invoked at connect time instead of at transport set-up time. It will be responsible for allocating per- connection resources. In this patch it allocates the CQs and creates a QP. More to come. rpcrdma_ep_destroy() is the inverse functionality that is invoked at disconnect time. It will be responsible for releasing the CQs and QP. These changes should be safe to do because both connect and disconnect is guaranteed to be serialized by the transport send lock. This takes us another step closer to resolving the address and route only at connect time so that connection failover to another device will work correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Enhance MR-related trace pointsChuck Lever
Two changes: - Show the number of SG entries that were mapped. This helps debug DMA-related problems. - Record the MR's resource ID instead of its memory address. This groups each MR with its associated rdma-tool output, and reduces needless exposure of memory addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-16svcrdma: Avoid DMA mapping small RPC RepliesChuck Lever
On some platforms, DMA mapping part of a page is more costly than copying bytes. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Since pull-up is now a more a frequent operation, I've introduced a trace point in the pull-up path. It can be used for debugging or user-space tools that count pull-up frequency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Fix double sync of transport header bufferChuck Lever
Performance optimization: Avoid syncing the transport buffer twice when Reply buffer pull-up is necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Refactor chunk list encodersChuck Lever
Same idea as the receive-side changes I did a while back: use xdr_stream helpers rather than open-coding the XDR chunk list encoders. This builds the Reply transport header from beginning to end without backtracking. As additional clean-ups, fill in documenting comments for the XDR encoders and sprinkle some trace points in the new encoding functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC: Add encoders for list item discriminatorsChuck Lever
Clean up. These are taken from the client-side RPC/RDMA transport to a more global header file so they can be used elsewhere. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Rename svcrdma_encode trace points in send routinesChuck Lever
These trace points are misnamed: trace_svcrdma_encode_wseg trace_svcrdma_encode_write trace_svcrdma_encode_reply trace_svcrdma_encode_rseg trace_svcrdma_encode_read trace_svcrdma_encode_pzr Because they actually trace posting on the Send Queue. Let's rename them so that I can add trace points in the chunk list encoders that actually do trace chunk list encoding events. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_send_reply_msg()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into those lower-level functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_map_reply_msg()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into those lower-level functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_send_reply_chunk()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into the lower-level send functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: De-duplicate code that locates Write and Reply chunksChuck Lever
Cache the locations of the Requester-provided Write list and Reply chunk so that the Send path doesn't need to parse the Call header again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Use struct xdr_stream to decode ingress transport headersChuck Lever
The logic that checks incoming network headers has to be scrupulous. De-duplicate: replace open-coded buffer overflow checks with the use of xdr_stream helpers that are used most everywhere else XDR decoding is done. One minor change to the sanity checks: instead of checking the length of individual segments, cap the length of the whole chunk to be sure it can fit in the set of pages available in rq_pages. This should be a better test of whether the server can handle the chunks in each request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Remove svcrdma_cm_event() trace pointChuck Lever
Clean up. This trace point is no longer needed because the RDMA/core CMA code has an equivalent trace point that was added by commit ed999f820a6c ("RDMA/cma: Add trace points in RDMA Connection Manager"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC: Add xdr_pad_size() helperChuck Lever
Introduce a helper function to compute the XDR pad size of a variable-length XDR object. Clean up: Replace open-coded calculation of XDR pad sizes. I'm sure I haven't found every instance of this calculation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Fix double svc_rdma_send_ctxt_put() in an error pathChuck Lever
This error path is almost never executed. Found by code inspection. Fixes: 99722fe4d5a6 ("svcrdma: Persistently allocate and DMA-map Send buffers") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16nfsd: Fix NFSv4 READ on RDMA when using readvChuck Lever
svcrdma expects that the payload falls precisely into the xdr_buf page vector. This does not seem to be the case for nfsd4_encode_readv(). This code is called only when fops->splice_read is missing or when RQ_SPLICE_OK is clear, so it's not a noticeable problem in many common cases. Add new transport method: ->xpo_read_payload so that when a READ payload does not fit exactly in rq_res's page vector, the XDR encoder can inform the RPC transport exactly where that payload is, without the payload's XDR pad. That way, when a Write chunk is present, the transport knows what byte range in the Reply message is supposed to be matched with the chunk. Note that the Linux NFS server implementation of NFS/RDMA can currently handle only one Write chunk per RPC-over-RDMA message. This simplifies the implementation of this fix. Fixes: b04209806384 ("nfsd4: allow exotic read compounds") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-02-13xprtrdma: Fix DMA scatter-gather list mapping imbalanceChuck Lever
The @nents value that was passed to ib_dma_map_sg() has to be passed to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to concatenate sg entries, it will return a different nents value than it was passed. The bug was exposed by recent changes to the AMD IOMMU driver, which enabled sg entry concatenation. Looking all the way back to commit 4143f34e01e9 ("xprtrdma: Port to new memory registration API") and reviewing other kernel ULPs, it's not clear that the frwr_map() logic was ever correct for this case. Reported-by: Andre Tomt <andre@tomt.net> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: DMA map rr_rdma_buf as each rpcrdma_rep is createdChuck Lever
Clean up: This simplifies the logic in rpcrdma_post_recvs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Destroy reps from previous connection instanceChuck Lever
To safely get rid of all rpcrdma_reps from a particular connection instance, xprtrdma has to wait until each of those reps is finished being used. A rep may be backing the rq_rcv_buf of an RPC that has just completed, for example. Since it is safe to invoke rpcrdma_rep_destroy() only in the Receive completion handler, simply mark reps remaining in the rb_all_reps list after the transport is drained. These will then be deleted as rpcrdma_post_recvs pulls them off the rep free list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Destroy rpcrdma_rep when Receive is flushedChuck Lever
This reduces the hardware and memory footprint of an unconnected transport. At some point in the future, transport reconnect will allow resolving the destination IP address through a different device. The current change enables reps for the new connection to be allocated on whichever NUMA node the new device affines to after a reconnect. Note that this does not destroy _all_ the transport's reps... there will be a few that are still part of a running RPC completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Allocate and map transport header buffers at connect timeChuck Lever
Currently the underlying RDMA device is chosen at transport set-up time. But it will soon be at connect time instead. The maximum size of a transport header is based on device capabilities. Thus transport header buffers have to be allocated _after_ the underlying device has been chosen (via address and route resolution); ie, in the connect worker. Thus, move the allocation of transport header buffers to the connect worker, after the point at which the underlying RDMA device has been chosen. This also means the RDMA device is available to do a DMA mapping of these buffers at connect time, instead of in the hot I/O path. Make that optimization as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Refactor frwr_is_supportedChuck Lever
Refactor: Perform the "is supported" check in rpcrdma_ep_create() instead of in rpcrdma_ia_open(). frwr_open() is where most of the logic to query device attributes is already located. The current code displays a redundant error message when the device does not support FRWR. As an additional clean-up, this patch removes the extra message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Eliminate per-transport "max pages"Chuck Lever
To support device hotplug and migrating a connection between devices of different capabilities, we have to guarantee that all in-kernel devices can support the same max NFS payload size (1 megabyte). This means that possibly one or two in-tree devices are no longer supported for NFS/RDMA because they cannot support 1MB rsize/wsize. The only one I confirmed was cxgb3, but it has already been removed from the kernel. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Refactor initialization of ep->rep_max_requestsChuck Lever
Clean up: there is no need to keep two copies of the same value. Also, in subsequent patches, rpcrdma_ep_create() will be called in the connect worker rather than at set-up time. Minor fix: Initialize the transport's sendctx to the value based on the capabilities of the underlying device, not the maximum setting. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Make sendctx queue lifetime the same as connection lifetimeChuck Lever
The size of the sendctx queue depends on the value stored in ia->ri_max_send_sges. This value is determined by querying the underlying device. Eventually, rpcrdma_ia_open() and rpcrdma_ep_create() will be called in the connect worker rather than at transport set-up time. The underlying device will not have been chosen device set-up time. The sendctx queue will thus have to be created after the underlying device has been chosen via address and route resolution; in other words, in the connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15xprtrdma: Eliminate ri_max_send_sgesChuck Lever
Clean-up. The max_send_sge value also happens to be stored in ep->rep_attr. Let's keep just a single copy. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix oops in Receive handler after device removalChuck Lever
Since v5.4, a device removal occasionally triggered this oops: Dec 2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219 Dec 2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode Dec 2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page Dec 2 17:13:53 manet kernel: PGD 0 P4D 0 Dec 2 17:13:53 manet kernel: Oops: 0000 [#1] SMP Dec 2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G W 5.4.0-00050-g53717e43af61 #883 Dec 2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Dec 2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] Dec 2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma] Dec 2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0 Dec 2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246 Dec 2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048 Dec 2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001 Dec 2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000 Dec 2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428 Dec 2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010 Dec 2 17:13:53 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000 Dec 2 17:13:53 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0 Dec 2 17:13:53 manet kernel: Call Trace: Dec 2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core] Dec 2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core] Dec 2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a Dec 2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf Dec 2 17:13:53 manet kernel: kthread+0xf4/0xf9 Dec 2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74 Dec 2 17:13:53 manet kernel: ret_from_fork+0x24/0x30 The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that is still pointing to the old ib_device, which has been freed. The only way that is possible is if this rpcrdma_rep was not destroyed by rpcrdma_ia_remove. Debugging showed that was indeed the case: this rpcrdma_rep was still in use by a completing RPC at the time of the device removal, and thus wasn't on the rep free list. So, it was not found by rpcrdma_reps_destroy(). The fix is to introduce a list of all rpcrdma_reps so that they all can be found when a device is removed. That list is used to perform only regbuf DMA unmapping, replacing that call to rpcrdma_reps_destroy(). Meanwhile, to prevent corruption of this list, I've moved the destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs(). rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus protecting the rb_all_reps list. Fixes: b0b227f071a0 ("xprtrdma: Use an llist to manage free rpcrdma_reps") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix completion wait during device removalChuck Lever
I've found that on occasion, "rmmod <dev>" will hang while if an NFS is under load. Ensure that ri_remove_done is initialized only just before the transport is woken up to force a close. This avoids the completion possibly getting initialized again while the CM event handler is waiting for a wake-up. Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from under an NFS mount") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14xprtrdma: Fix create_qp crash on device unloadChuck Lever
On device re-insertion, the RDMA device driver crashes trying to set up a new QP: Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0 Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page Nov 27 16:32:06 manet kernel: PGD 0 P4D 0 Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G W 5.4.0 #852 Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015 Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma] Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12 Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 <f0> 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00 Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046 Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000 Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0 Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000 Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8 Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000 Nov 27 16:32:06 manet kernel: FS: 0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000 Nov 27 16:32:06 manet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0 Nov 27 16:32:06 manet kernel: Call Trace: Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib] Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139 Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib] The fix is to copy the qp_init_attr struct that was just created by rpcrdma_ep_create() instead of using the one from the previous connection instance. Fixes: 98ef77d1aaa7 ("xprtrdma: Send Queue size grows after a reconnect") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-12-07Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "This is a relatively quiet cycle for nfsd, mainly various bugfixes. Possibly most interesting is Trond's fixes for some callback races that were due to my incomplete understanding of rpc client shutdown. Unfortunately at the last minute I've started noticing a new intermittent failure to send callbacks. As the logic seems basically correct, I'm leaving Trond's patches in for now, and hope to find a fix in the next week so I don't have to revert those patches" * tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits) nfsd: depend on CRYPTO_MD5 for legacy client tracking NFSD fixing possible null pointer derefering in copy offload nfsd: check for EBUSY from vfs_rmdir/vfs_unink. nfsd: Ensure CLONE persists data and metadata changes to the target file SUNRPC: Fix backchannel latency metrics nfsd: restore NFSv3 ACL support nfsd: v4 support requires CRYPTO_SHA256 nfsd: Fix cld_net->cn_tfm initialization lockd: remove __KERNEL__ ifdefs sunrpc: remove __KERNEL__ ifdefs race in exportfs_decode_fh() nfsd: Drop LIST_HEAD where the variable it declares is never used. nfsd: document callback_wq serialization of callback code nfsd: mark cb path down on unknown errors nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback() nfsd: minor 4.1 callback cleanup SUNRPC: Fix svcauth_gss_proxy_init() SUNRPC: Trace gssproxy upcall results sunrpc: fix crash when cache_head become valid before update nfsd: remove private bin2hex implementation ...
2019-11-21SUNRPC: Fix backchannel latency metricsChuck Lever
I noticed that for callback requests, the reported backlog latency is always zero, and the rtt value is crazy big. The problem was that rqst->rq_xtime is never set for backchannel requests. Fixes: 78215759e20d ("SUNRPC: Make RTT measurement more ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-11-18Merge tag 'nfs-rdma-for-5.5-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA Client Updates for Linux 5.5 New Features: - New tracepoints for congestion control and Local Invalidate WRs Bugfixes and Cleanups: - Eliminate log noise in call_reserveresult - Fix unstable connections after a reconnect - Clean up some code duplication - Close race between waking a sender and posting a receive - Fix MR list corruption, and clean up MR usage - Remove unused rpcrdma_sendctx fields - Try to avoid DMA mapping pages if it is too costly - Wake pending tasks if connection fails - Replace some dprintk()s with tracepoints
2019-10-30SUNRPC: The RDMA back channel mustn't disappear while requests are outstandingTrond Myklebust
If there are RDMA back channel requests being processed by the server threads, then we should hold a reference to the transport to ensure it doesn't get freed from underneath us. Reported-by: Neil Brown <neilb@suse.de> Fixes: 63cae47005af ("xprtrdma: Handle incoming backward direction RPC calls") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Replace dprintk in xprt_rdma_set_portChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>