summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)Author
2018-05-11svcrdma: Introduce svc_rdma_recv_ctxtChuck Lever
svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt free list. This eliminates the overhead of calling kmalloc / kfree, both of which grab a globally shared lock that disables interrupts. To reduce contention further, separate the use of these objects in the Receive and Send paths in svcrdma. Subsequent patches will take advantage of this separation by allocating real resources which are then cached in these objects. The allocations are freed when the transport is torn down. I've renamed the structure so that static type checking can be used to ensure that uses of op_ctxt and recv_ctxt are not confused. As an additional clean up, structure fields are renamed to conform with kernel coding conventions. As a final clean up, helpers related to recv_ctxt are moved closer to the functions that use them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Trace key RDMA API eventsChuck Lever
This includes: * Posting on the Send and Receive queues * Send, Receive, Read, and Write completion * Connect upcalls * QP errors Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Trace key RPC/RDMA protocol eventsChuck Lever
This includes: * Transport accept and tear-down * Decisions about using Write and Reply chunks * Each RDMA segment that is handled * Whenever an RDMA_ERR is sent As a clean-up, I've standardized the order of the includes, and removed some now redundant dprintk call sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11xprtrdma: Prepare RPC/RDMA includes for server-side trace pointsChuck Lever
Clean up: Move #include <trace/events/rpcrdma.h> into source files, similar to how it is done with trace/events/sunrpc.h. Server-side trace points will be part of the rpcrdma subsystem, just like the client-side trace points. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Use passed-in net namespace when creating RDMA listenerChuck Lever
Ensure each RDMA listener and its children transports are created in the same net namespace as the user that started the NFS service. This is similar to how listener sockets are created in svc_create_socket, required for enabling support for containers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-11svcrdma: Add proper SPDX tags for NetApp-contributed sourceChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-07xprtrdma: Make rpcrdma_sendctx_put_locked() a static functionChuck Lever
Clean up: The only call site is in the same file as the function's definition. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_buffer_get_rep_locked()Chuck Lever
Clean up: There is only one remaining call site for this helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_buffer_get_req_locked()Chuck Lever
Clean up. There is only one call-site for this helper, and it can be simplified by using list_first_entry_or_null(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv}Chuck Lever
Clean up: These functions are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Move Receive posting to Receive handlerChuck Lever
Receive completion and Reply handling are done by a BOUND workqueue, meaning they run on only one CPU. Posting receives is currently done in the send_request path, which on large systems is typically done on a different CPU than the one handling Receive completions. This results in movement of Receive-related cachelines between the sending and receiving CPUs. More importantly, it means that currently Receives are posted while the transport's write lock is held, which is unnecessary and costly. Finally, allocation of Receive buffers is performed on-demand in the Receive completion handler. This helps guarantee that they are allocated on the same NUMA node as the CPU that handles Receive completions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Clean up Receive trace pointsChuck Lever
For clarity, report the posting and completion of Receive CQEs. Also, the wc->byte_len field contains garbage if wc->status is non-zero, and the vendor error field contains garbage if wc->status is zero. For readability, don't save those fields in those cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Make rpc_rqst part of rpcrdma_reqChuck Lever
This simplifies allocation of the generic RPC slot and xprtrdma specific per-RPC resources. It also makes xprtrdma more like the socket-based transports: ->buf_alloc and ->buf_free are now responsible only for send and receive buffers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Introduce ->alloc_slot call-out for xprtrdmaChuck Lever
rpcrdma_buffer_get acquires an rpcrdma_req and rep for each RPC. Currently this is done in the call_allocate action, and sometimes it can fail if there are many outstanding RPCs. When call_allocate fails, the RPC task is put on the delayq. It is awoken a few milliseconds later, but there's no guarantee it will get a buffer at that time. The RPC task can be repeatedly put back to sleep or even starved. The call_allocate action should rarely fail. The delayq mechanism is not meant to deal with transport congestion. In the current sunrpc stack, there is a friendlier way to deal with this situation. These objects are actually tantamount to an RPC slot (rpc_rqst) and there is a separate FSM action, distinct from call_allocate, for allocating slot resources. This is the call_reserve action. When allocation fails during this action, the RPC is placed on the transport's backlog queue. The backlog mechanism provides a stronger guarantee that when the RPC is awoken, a buffer will be available for it; and backlogged RPCs are awoken one-at-a-time. To make slot resource allocation occur in the call_reserve action, create special ->alloc_slot and ->free_slot call-outs for xprtrdma. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07SUNRPC: Add a ->free_slot transport calloutChuck Lever
Refactor: xprtrdma needs to have better control over when RPCs are awoken from the backlog queue, so replace xprt_free_slot with a transport op callout. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Fix max_send_wr computationChuck Lever
For FRWR, the computation of max_send_wr is split between frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell that the max_send_wr result is currently incorrect if frwr_op_open has to reduce the credit limit to accommodate a small max_qp_wr. This is a problem now that extra WRs are needed for backchannel operations and a drain CQE. So, refactor the computation so that it is all done in ->ro_open, and fix the FRWR version of this computation so that it accommodates HCAs with small max_qp_wr correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Create transport's CM ID in the correct network namespaceChuck Lever
Set up RPC/RDMA transport in mount.nfs's network namespace. This passes the correct namespace information to the RDMA core, similar to how RPC sockets are created (see xs_create_sock). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Try to fail quickly if proto=rdmaChuck Lever
rdma_resolve_addr(3) says: > This call is used to map a given destination IP address to a > usable RDMA address. The IP to RDMA address mapping is done > using the local routing tables, or via ARP. If this can't be done, there's no local device that can be used to establish an RDMA-capable network path to the remote. In this case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR upcall. Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH forever, thinking that this is a temporary situation. This makes mount.nfs appear to hang if I try to mount with proto=rdma through, say, a conventional Ethernet device. If the admin has specified proto=rdma along with a server IP address that requires a network path that does not support RDMA, instead let's fail with a permanent error. -EPROTONOSUPPORT is returned when NFSv4 or one of its minor versions is not supported. -EPROTO is not (currently) retried by mount.nfs. There are potentially other similar cases where -EPROTO is an appropriate return code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <kolga@netapp.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Add proper SPDX tags for NetApp-contributed sourceChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-01xprtrdma: Fix list corruption / DMAR errors during MR recoveryChuck Lever
The ro_release_mr methods check whether mr->mr_list is empty. Therefore, be sure to always use list_del_init when removing an MR linked into a list using that field. Otherwise, when recovering from transport failures or device removal, list corruption can result, or MRs can get mapped or unmapped an odd number of times, resulting in IOMMU-related failures. In general this fix is appropriate back to v4.8. However, code changes since then make it impossible to apply this patch directly to stable kernels. The fix would have to be applied by hand or reworked for kernels earlier than v4.16. Backport guidance -- there are several cases: - When creating an MR, initialize mr_list so that using list_empty on an as-yet-unused MR is safe. - When an MR is being handled by the remote invalidation path, ensure that mr_list is reinitialized when it is removed from rl_registered. - When an MR is being handled by rpcrdma_destroy_mrs, it is removed from mr_all, but it may still be on an rl_registered list. In that case, the MR needs to be removed from that list before being released. - Other cases are covered by using list_del_init in rpcrdma_mr_pop. Fixes: 9d6b04097882 ('xprtrdma: Place registered MWs on a ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-12Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - xprtrdma: Fix corner cases when handling device removal # v4.12+ - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+ Features: - New sunrpc tracepoint for RPC pings - Finer grained NFSv4 attribute checking - Don't unnecessarily return NFS v4 delegations Other bugfixes and cleanups: - Several other small NFSoRDMA cleanups - Improvements to the sunrpc RTT measurements - A few sunrpc tracepoint cleanups - Various fixes for NFS v4 lock notifications - Various sunrpc and NFS v4 XDR encoding cleanups - Switch to the ida_simple API - Fix NFSv4.1 exclusive create - Forget acl cache after setattr operation - Don't advance the nfs_entry readdir cookie if xdr decoding fails" * tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits) NFS: advance nfs_entry cookie only after decoding completes successfully NFSv3/acl: forget acl cache after setattr NFSv4.1: Fix exclusive create NFSv4: Declare the size up to date after it was set. nfs: Use ida_simple API NFSv4: Fix the nfs_inode_set_delegation() arguments NFSv4: Clean up CB_GETATTR encoding NFSv4: Don't ask for attributes when ACCESS is protected by a delegation NFSv4: Add a helper to encode/decode struct timespec NFSv4: Clean up encode_attrs NFSv4; Clean up XDR encoding of type bitmap4 NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group SUNRPC: Add a helper for encoding opaque data inline SUNRPC: Add helpers for decoding opaque and string types NFSv4: Ignore change attribute invalidations if we hold a delegation NFS: More fine grained attribute tracking NFS: Don't force unnecessary cache invalidation in nfs_update_inode() NFS: Don't redirty the attribute cache in nfs_wcc_update_inode() NFS: Don't force a revalidation of all attributes if change is missing NFS: Convert NFS_INO_INVALID flags to unsigned long ...
2018-04-10xprtrdma: Fix corner cases when handling device removalChuck Lever
Michal Kalderon has found some corner cases around device unload with active NFS mounts that I didn't have the imagination to test when xprtrdma device removal was added last year. - The ULP device removal handler is responsible for deallocating the PD. That wasn't clear to me initially, and my own testing suggested it was not necessary, but that is incorrect. - The transport destruction path can no longer assume that there is a valid ID. - When destroying a transport, ensure that ib_free_cq() is not invoked on a CQ that was already released. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10SUNRPC: Make RTT measurement more precise (Send)Chuck Lever
Some RPC transports have more overhead in their send_request callouts than others. For example, for RPC-over-RDMA: - Marshaling an RPC often has to DMA map the RPC arguments - Registration methods perform memory registration as part of marshaling To capture just server and network latencies more precisely: when sending a Call, capture the rq_xtime timestamp _after_ the transport header has been marshaled. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Move creation of rl_rdmabuf to rpcrdma_create_reqChuck Lever
Refactor: Both rpcrdma_create_req call sites have to allocate the buffer where the transport header is built, so just move that allocation into rpcrdma_create_req. This buffer is a fixed size. There's no needed information available in call_allocate that is not also available when the transport is created. The original purpose for allocating these buffers on demand was to reduce the possibility that an allocation failure during transport creation will hork the mount operation during low memory scenarios. Some relief for this rare possibility is coming up in the next few patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Chain Send to FastReg WRsChuck Lever
With FRWR, the client transport can perform memory registration and post a Send with just a single ib_post_send. This reduces contention between the send_request path and the Send Completion handlers, and reduces the overhead of registering a chunk that has multiple segments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: "Support" call-only RPCsChuck Lever
RPC-over-RDMA version 1 credit accounting relies on there being a response message for every RPC Call. This means that RPC procedures that have no reply will disrupt credit accounting, just in the same way as a retransmit would (since it is sent because no reply has arrived). Deal with the "no reply" case the same way. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Reduce number of MRs created by rpcrdma_mrs_createChuck Lever
Create fewer MRs on average. Many workloads don't need as many as 32 MRs, and the transport can now quickly restock the MR free list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: ->send_request returns -EAGAIN when there are no free MRsChuck Lever
Currently, when the MR free list is exhausted during marshaling, the RPC/RDMA transport places the RPC task on the delayq, which forces a wait for HZ >> 2 before the marshal and send is retried. With this change, the transport now places such an RPC task on the pending queue, and wakes it just as soon as more MRs have been created. Creating more MRs typically takes less than a millisecond, and this waking mechanism is less deadlock-prone. Moreover, the waiting RPC task is holding the transport's write lock, which blocks the transport from sending RPCs. Therefore faster recovery from MR exhaustion is desirable. This is the same mechanism that the TCP transport utilizes when handling write buffer space exhaustion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Remove xprt-specific connect cookieChuck Lever
Clean up: The generic rq_connect_cookie is sufficient to detect RPC Call retransmission. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Remove arbitrary limit on initiator depthChuck Lever
Clean up: We need to check only that the value does not exceed the range of the u8 field it's going into. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10xprtrdma: Fix latency regression on NUMA NFS/RDMA clientsChuck Lever
With v4.15, on one of my NFS/RDMA clients I measured a nearly doubling in the latency of small read and write system calls. There was no change in server round trip time. The extra latency appears in the whole RPC execution path. "git bisect" settled on commit ccede7598588 ("xprtrdma: Spread reply processing over more CPUs") . After some experimentation, I found that leaving the WQ bound and allowing the scheduler to pick the dispatch CPU seems to eliminate the long latencies, and it does not introduce any new regressions. The fix is implemented by reverting only the part of commit ccede7598588 ("xprtrdma: Spread reply processing over more CPUs") that dispatches RPC replies specifically on the CPU where the matching RPC call was made. Interestingly, saving the CPU number and later queuing reply processing there was effective _only_ for a NFS READ and WRITE request. On my NUMA client, in-kernel RPC reply processing for asynchronous RPCs was dispatched on the same CPU where the RPC call was made, as expected. However synchronous RPCs seem to get their reply dispatched on some other CPU than where the call was placed, every time. Fixes: ccede7598588 ("xprtrdma: Spread reply processing over ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-03sunrpc: Save remote presentation address in svc_xprt for trace eventsChuck Lever
TP_printk defines a format string that is passed to user space for converting raw trace event records to something human-readable. My user space's printf (Oracle Linux 7), however, does not have a %pI format specifier. The result is that what is supposed to be an IP address in the output of "trace-cmd report" is just a string that says the field couldn't be displayed. To fix this, adopt the same approach as the client: maintain a pre- formated presentation address for occasions when %pI is not available. The location of the trace_svc_send trace point is adjusted so that rqst->rq_xprt is not NULL when the trace event is recorded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-04-03svc: Simplify ->xpo_secure_portChuck Lever
Clean up: Instead of returning a value that is used to set or clear a bit, just make ->xpo_secure_port mangle that bit, and return void. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20svcrdma: Clean up rdma_build_arg_xdrChuck Lever
Clean up: The value of the byte_count parameter is already passed to rdma_build_arg_xdr as part of the svc_rdma_op_ctxt structure. Further, without the parameter called "byte_count" there is no need to have the abbreviated "bc" automatic variable. "bc" can now be called something more intuitive. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20svcrdma: Consult max_qp_init_rd_atom when accepting connectionsChuck Lever
The target needs to return the lesser of the client's Inbound RDMA Read Queue Depth (IRD), provided in the connection parameters, and the local device's Outbound RDMA Read Queue Depth (ORD). The latter limit is max_qp_init_rd_atom, not max_qp_rd_atom. The svcrdma_ord value caps the ORD value for iWARP transports, which do not exchange ORD/IRD values at connection time. Since no other Linux kernel RDMA-enabled storage target sees fit to provide this cap, I'm removing it here too. initiator_depth is a u8, so ensure the computed ORD value does not overflow that field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-03-20svcrdma: Use pr_err to report Receive errorsChuck Lever
Clean up: Other completion handlers use pr_err, not pr_warn. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-02-09Merge tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull more NFS client updates from Trond Myklebust: "A few bugfixes and some small sunrpc latency/performance improvements before the merge window closes: Stable fixes: - fix an incorrect calculation of the RDMA send scatter gather element limit - fix an Oops when attempting to free resources after RDMA device removal Bugfixes: - SUNRPC: Ensure we always release the TCP socket in a timely fashion when the connection is shut down. - SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context Latency/Performance: - SUNRPC: Queue latency sensitive socket tasks to the less contended xprtiod queue - SUNRPC: Make the xprtiod workqueue unbounded. - SUNRPC: Make the rpciod workqueue unbounded" * tag 'nfs-for-4.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible context fix parallelism for rpc tasks Make the xprtiod workqueue unbounded. SUNRPC: Queue latency-sensitive socket tasks to xprtiod SUNRPC: Ensure we always close the socket after a connection shuts down xprtrdma: Fix BUG after a device removal xprtrdma: Fix calculation of ri_max_send_sges
2018-02-08Merge tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd update from Bruce Fields: "A fairly small update this time around. Some cleanup, RDMA fixes, overlayfs fixes, and a fix for an NFSv4 state bug. The bigger deal for nfsd this time around was Jeff Layton's already-merged i_version patches" * tag 'nfsd-4.16' of git://linux-nfs.org/~bfields/linux: svcrdma: Fix Read chunk round-up NFSD: hide unused svcxdr_dupstr() nfsd: store stat times in fill_pre_wcc() instead of inode times nfsd: encode stat->mtime for getattr instead of inode->i_mtime nfsd: return RESOURCE not GARBAGE_ARGS on too many ops nfsd4: don't set lock stateid's sc_type to CLOSED nfsd: Detect unhashed stids in nfsd4_verify_open_stid() sunrpc: remove dead code in svc_sock_setbufsize svcrdma: Post Receives in the Receive completion handler nfsd4: permit layoutget of executable-only files lockd: convert nlm_rqst.a_count from atomic_t to refcount_t lockd: convert nlm_lockowner.count from atomic_t to refcount_t lockd: convert nsm_handle.sm_count from atomic_t to refcount_t
2018-02-08svcrdma: Fix Read chunk round-upChuck Lever
A single NFSv4 WRITE compound can often have three operations: PUTFH, WRITE, then GETATTR. When the WRITE payload is sent in a Read chunk, the client places the GETATTR in the inline part of the RPC/RDMA message, just after the WRITE operation (sans payload). The position value in the Read chunk enables the receiver to insert the Read chunk at the correct place in the received XDR stream; that is between the WRITE and GETATTR. According to RFC 8166, an NFS/RDMA client does not have to add XDR round-up to the Read chunk that carries the WRITE payload. The receiver adds XDR round-up padding if it is absent and the receiver's XDR decoder requires it to be present. Commit 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving") attempted to add support for receiving such a compound so that just the WRITE payload appears in rq_arg's page list, and the trailing GETATTR is placed in rq_arg's tail iovec. (TCP just strings the whole compound into the head iovec and page list, without regard to the alignment of the WRITE payload). The server transport logic also had to accommodate the optional XDR round-up of the Read chunk, which it did simply by lengthening the tail iovec when round-up was needed. This approach is adequate for the NFSv2 and NFSv3 WRITE decoders. Unfortunately it is not sufficient for nfsd4_decode_write. When the Read chunk length is a couple of bytes less than PAGE_SIZE, the computation at the end of nfsd4_decode_write allows argp->pagelen to go negative, which breaks the logic in read_buf that looks for the tail iovec. The result is that a WRITE operation whose payload length is just less than a multiple of a page succeeds, but the subsequent GETATTR in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR decoder can't find it. Clients ignore the error, but they must update their attribute cache via a separate round trip. As nfsd4_decode_write appears to expect the payload itself to always have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk add the Read chunk XDR round-up to the page_len rather than lengthening the tail iovec. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-02-02xprtrdma: Fix BUG after a device removalChuck Lever
Michal Kalderon reports a BUG that occurs just after device removal: [ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049 [ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma] The RPC/RDMA client transport attempts to allocate some resources on demand. Registered buffers are one such resource. These are allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call and Reply messages. A hardware resource is associated with each of these buffers, as they can be used for a Send or Receive Work Request. If a device is removed from under an NFS/RDMA mount, the transport layer is responsible for releasing all hardware resources before the device can be finally unplugged. A BUG results when the NFS mount hasn't yet seen much activity: the transport tries to release resources that haven't yet been allocated. rpcrdma_free_regbuf() already checks for this case, so just move that check to cover the DEVICE_REMOVAL case as well. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-02-02xprtrdma: Fix calculation of ri_max_send_sgesChuck Lever
Commit 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes a problem where xprtrdma would not work if the device's max_sge capability was small (low single digits). At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of each RPC. ri_max_send_sges is set to this value: ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES; Then when marshaling each RPC, rpcrdma_args_inline uses that value to determine whether the device has enough Send SGEs to convey an NFS WRITE payload inline, or whether instead a Read chunk is required. More recently, commit ae72950abf99 ("xprtrdma: Add data structure to manage RDMA Send arguments") used the ri_max_send_sges value to calculate the size of an array, but that commit erroneously assumed ri_max_send_sges contains a value similar to the device's max_sge, and not one that was reduced by the minimum SGE count. This assumption results in the calculated size of the sendctx's Send SGE array to be too small. When the array is used to marshal an RPC, the code can write Send SGEs into the following sendctx element in that array, corrupting it. When the device's max_sge is large, this issue is entirely harmless; but it results in an oops in the provider's post_send method, if dev.attrs.max_sge is small. So let's straighten this out: ri_max_send_sges will now contain a value with the same meaning as dev.attrs.max_sge, which makes the code easier to understand, and enables rpcrdma_sendctx_create to calculate the size of the SGE array correctly. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23SUNRPC: Trace xprt_timer eventsChuck Lever
Track RPC timeouts: report the XID and the server address to match the content of network capture. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Correct some documenting commentsChuck Lever
Fix kernel-doc warnings in net/sunrpc/xprtrdma/ . net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count' net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv' net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt' net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call' Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Fix "bytes registered" accountingChuck Lever
The contents of seg->mr_len changed when ->ro_map stopped returning the full chunk length in the first segment. Count the full length of each Write chunk, not the length of the first segment (which now can only be as large as a page). Fixes: 9d6b04097882 ("xprtrdma: Place registered MWs on a ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Instrument allocation/release of rpcrdma_req/rep objectsChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Add trace points to instrument QP and CQ access upcallsChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Add trace points in the client-side backchannel code pathsChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Add trace points for connect eventsChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Add trace points to instrument MR allocation and recoveryChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-01-23xprtrdma: Add trace points to instrument memory invalidationChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>