summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma/verbs.c
AgeCommit message (Collapse)Author
2019-04-11xprtrdma: Fix helper that drains the transportChuck Lever
We want to drain only the RQ first. Otherwise the transport can deadlock on ->close if there are outstanding Send completions. Fixes: 6d2d0ee27c7a ("xprtrdma: Replace rpcrdma_receive_wq ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v5.0+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-25Merge tag 'nfs-rdma-for-5.1-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.1 New features: - Convert rpc auth layer to use xdr_streams - Config option to disable insecure enctypes - Reduce size of RPC receive buffers Bugfixes and cleanups: - Fix sparse warnings - Check inline size before providing a write chunk - Reduce the receive doorbell rate - Various tracepoint improvements [Trond: Fix up merge conflicts] Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-13xprtrdma: Reduce the doorbell rate (Receive)Chuck Lever
Post RECV WRs in batches to reduce the hardware doorbell rate per transport. This helps the RPC-over-RDMA client scale better in number of transports. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-02-12xprtrdma: Make sure Send CQ is allocated on an existing compvecNicolas Morey-Chaisemartin
Make sure the device has at least 2 completion vectors before allocating to compvec#1 Fixes: a4699f5647f3 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode) Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-08xprtrdma: Double free in rpcrdma_sendctxs_create()Dan Carpenter
The clean up is handled by the caller, rpcrdma_buffer_create(), so this call to rpcrdma_sendctxs_destroy() leads to a double free. Fixes: ae72950abf99 ("xprtrdma: Add data structure to manage RDMA Send arguments") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-08xprtrdma: Fix error code in rpcrdma_buffer_create()Dan Carpenter
This should return -ENOMEM if __alloc_workqueue_key() fails, but it returns success. Fixes: 6d2d0ee27c7a ("xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Add documenting comment for rpcrdma_buffer_destroyChuck Lever
Make a note of the function's dependency on an earlier ib_drain_qp. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Replace outdated comment for rpcrdma_ep_postChuck Lever
Since commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to Receive handler"), rpcrdma_ep_post is no longer responsible for posting Receive buffers. Update the documenting comment to reflect this change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Trace mapping, alloc, and dereg failuresChuck Lever
These are rare, but can be helpful at tracking down DMAR and other problems. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Cull dprintk() call sitesChuck Lever
Clean up: Remove dprintk() call sites that report rare or impossible errors. Leave a few that display high-value low noise status information. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Simplify locking that protects the rl_allreqs listChuck Lever
Clean up: There's little chance of contention between the use of rb_lock and rb_reqslock, so merge the two. This avoids having to take both in some (possibly future) cases. Transport tear-down is already serialized, thus there is no need for locking at all when destroying rpcrdma_reqs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Remove rpcrdma_memreg_opsChuck Lever
Clean up: Now that there is only FRWR, there is no need for a memory registration switch. The indirect calls to the memreg operations can be replaced with faster direct calls. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Remove support for FMR memory registrationChuck Lever
FMR is not supported on most recent RDMA devices. It is also less secure than FRWR because an FMR memory registration can expose adjacent bytes to remote reading or writing. As discussed during the RDMA BoF at LPC 2018, it is time to remove support for FMR in the NFS/RDMA client stack. Note that NFS/RDMA server-side uses either local memory registration or FRWR. FMR is not used. There are a few Infiniband/RoCE devices in the kernel tree that do not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will not support client-side NFS/RDMA after this patch. These are: - mthca - qib - hns (RoCE) Users of these devices can use NFS/TCP on IPoIB instead. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Don't wake pending tasks until disconnect is doneChuck Lever
Transport disconnect processing does a "wake pending tasks" at various points. Suppose an RPC Reply is being processed. The RPC task that Reply goes with is waiting on the pending queue. If a disconnect wake-up happens before reply processing is done, that reply, even if it is good, is thrown away, and the RPC has to be sent again. This window apparently does not exist for socket transports because there is a lock held while a reply is being received which prevents the wake-up call until after reply processing is done. To resolve this, all RPC replies being processed on an RPC-over-RDMA transport have to complete before pending tasks are awoken due to a transport disconnect. Callers that already hold the transport write lock may invoke ->ops->close directly. Others use a generic helper that schedules a close when the write lock can be taken safely. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: No qp_event disconnectChuck Lever
After thinking about this more, and auditing other kernel ULP imple- mentations, I believe that a DISCONNECT cm_event will occur after a fatal QP event. If that's the case, there's no need for an explicit disconnect in the QP event handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueueChuck Lever
To address a connection-close ordering problem, we need the ability to drain the RPC completions running on rpcrdma_receive_wq for just one transport. Give each transport its own RPC completion workqueue, and drain that workqueue when disconnecting the transport. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Refactor Receive accountingChuck Lever
Clean up: Divide the work cleanly: - rpcrdma_wc_receive is responsible only for RDMA Receives - rpcrdma_reply_handler is responsible only for RPC Replies - the posted send and receive counts both belong in rpcrdma_ep Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02xprtrdma: Yet another double DMA-unmapChuck Lever
While chasing yet another set of DMAR fault reports, I noticed that the frwr recycler conflates whether or not an MR has been DMA unmapped with frwr->fr_state. Actually the two have only an indirect relationship. It's in fact impossible to guess reliably whether the MR has been DMA unmapped based on its fr_state field, especially as the surrounding code and its assumptions have changed over time. A better approach is to track the DMA mapping status explicitly so that the recycler is less brittle to unexpected situations, and attempts to DMA-unmap a second time are prevented. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.20 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Report when there were zero posted ReceivesChuck Lever
To show that a caller did attempt to allocate and post more Receive buffers, the trace point in rpcrdma_post_recvs() should report when rpcrdma_post_recvs() was invoked but no new Receive buffers were posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Move rb_flags initializationChuck Lever
Clean up: rb_flags might be used for other things besides RPCRDMA_BUF_F_EMPTY_SCQ, so initialize it in a generic spot instead of in a send-completion-queue-related helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Remove memory address of "ep" from an error messageChuck Lever
Clean up: Replace the hashed memory address of the target rpcrdma_ep with the server's IP address and port. The server address is more useful in an administrative error message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Rename rpcrdma_qp_async_error_upcallChuck Lever
Clean up: Use a function name that is consistent with the RDMA core API and with other consumers. Because this is a function that is invoked from outside the rpcrdma.ko module, add an appropriate documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Simplify RPC wake-ups on connectChuck Lever
Currently, when a connection is established, rpcrdma_conn_upcall invokes rpcrdma_conn_func and then wake_up_all(&ep->rep_connect_wait). The former wakes waiting RPCs, but the connect worker is not done yet, and that leads to races, double wakes, and difficulty understanding how this logic is supposed to work. Instead, collect all the "connection established" logic in the connect worker (xprt_rdma_connect_worker). A disconnect worker is retained to handle provider upcalls safely. Fixes: 254f91e2fa1f ("xprtrdma: RPC/RDMA must invoke ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-03xprtrdma: Re-organize the switch() in rpcrdma_conn_upcallChuck Lever
Clean up: Eliminate the FALLTHROUGH into the default arm to make the switch easier to understand. Also, as long as I'm here, do not display the memory address of the target rpcrdma_ep. A hashed memory address is of marginal use here. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Eliminate "connstate" variable from rpcrdma_conn_upcall()Chuck Lever
Clean up. Since commit 173b8f49b3af ("xprtrdma: Demote "connect" log messages") there has been no need to initialize connstat to zero. In fact, in this code path there's now no reason not to set rep_connected directly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Conventional variable names in rpcrdma_conn_upcallChuck Lever
Clean up: The convention throughout other parts of xprtrdma is to name variables of type struct rpcrdma_xprt "r_xprt", not "xprt". This convention enables the use of the name "xprt" for a "struct rpc_xprt" type variable, as in other parts of the RPC client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Rename rpcrdma_conn_upcallChuck Lever
Clean up: Use a function name that is consistent with the RDMA core API and with other consumers. Because this is a function that is invoked from outside the rpcrdma.ko module, add an appropriate documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Name MR trace events consistentlyChuck Lever
Clean up the names of trace events related to MRs so that it's easy to enable these with a glob. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Explicitly resetting MRs is no longer necessaryChuck Lever
When a memory operation fails, the MR's driver state might not match its hardware state. The only reliable recourse is to dereg the MR. This is done in ->ro_recover_mr, which then attempts to allocate a fresh MR to replace the released MR. Since commit e2ac236c0b651 ("xprtrdma: Allocate MRs on demand"), xprtrdma dynamically allocates MRs. It can add more MRs whenever they are needed. That makes it possible to simply release an MR when a memory operation fails, instead of "recovering" it. It will automatically be replaced by the on-demand MR allocator. This commit is a little larger than I wanted, but it replaces ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the rb_stale_mrs list with a generic work queue. Since MRs are no longer orphaned, the mrs_orphaned metric is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-10-02xprtrdma: Create more MRs at a timeChuck Lever
Some devices require more than 3 MRs to build a single 1MB I/O. Ensure that rpcrdma_mrs_create() will add enough MRs to build that I/O. In a subsequent patch I'm changing the MR recovery logic to just toss out the MRs. In that case it's possible for ->send_request to loop acquiring some MRs, not getting enough, getting called again, recycling the previous MRs, then not getting enough, lather rinse repeat. Thus first we need to ensure enough MRs are created to prevent that loop. I'm "reusing" ia->ri_max_segs. All of its accessors seem to want the maximum number of data segments plus two, so I'm going to bake that into the initial calculation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-23Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "These patches include adding async support for the v4.2 COPY operation. I think Bruce is planning to send the server patches for the next release, but I figured we could get the client side out of the way now since it's been in my tree for a while. This shouldn't cause any problems, since the server will still respond with synchronous copies even if the client requests async. Features: - Add support for asynchronous server-side COPY operations Stable bufixes: - Fix an off-by-one in bl_map_stripe() (v3.17+) - NFSv4 client live hangs after live data migration recovery (v4.9+) - xprtrdma: Fix disconnect regression (v4.18+) - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+) - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+) Other bugfixes and cleanups: - Optimizations and fixes involving NFS v4.1 / pNFS layout handling - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking - Immediately reschedule writeback when the server replies with an error - Fix excessive attribute revalidation in nfs_execute_ok() - Add error checking to nfs_idmap_prepare_message() - Use new vm_fault_t return type - Return a delegation when reclaiming one that the server has recalled - Referrals should inherit proto setting from parents - Make rpc_auth_create_args a const - Improvements to rpc_iostats tracking - Fix a potential reference leak when there is an error processing a callback - Fix rmdir / mkdir / rename nlink accounting - Fix updating inode change attribute - Fix error handling in nfsn4_sp4_select_mode() - Use an appropriate work queue for direct-write completion - Don't busy wait if NFSv4 session draining is interrupted" * tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits) pNFS: Remove unwanted optimisation of layoutget pNFS/flexfiles: ff_layout_pg_init_read should exit on error pNFS: Treat RECALLCONFLICT like DELAY... pNFS: When updating the stateid in layoutreturn, also update the recall range NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence() NFSv4: Fix locking in pnfs_generic_recover_commit_reqs NFSv4: Fix a typo in nfs4_init_channel_attrs() NFSv4: Don't busy wait if NFSv4 session draining is interrupted NFS recover from destination server reboot for copies NFS add a simple sync nfs4_proc_commit after async COPY NFS handle COPY ERR_OFFLOAD_NO_REQS NFS send OFFLOAD_CANCEL when COPY killed NFS export nfs4_async_handle_error NFS handle COPY reply CB_OFFLOAD call race NFS add support for asynchronous COPY NFS COPY xdr handle async reply NFS OFFLOAD_CANCEL xdr NFS CB_OFFLOAD xdr NFS: Use an appropriate work queue for direct-write completion NFSv4: Fix error handling in nfs4_sp4_select_mode() ...
2018-08-08xprtrdma: Fix disconnect regressionChuck Lever
I found that injecting disconnects with v4.18-rc resulted in random failures of the multi-threaded git regression test. The root cause appears to be that, after a reconnect, the RPC/RDMA transport is waking pending RPCs before the transport has posted enough Receive buffers to receive the Replies. If a Reply arrives before enough Receive buffers are posted, the connection is dropped. A few connection drops happen in quick succession as the client and server struggle to regain credit synchronization. This regression was introduced with commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to Receive handler"). The client is supposed to post a single Receive when a connection is established because it's not supposed to send more than one RPC Call before it gets a fresh credit grant in the first RPC Reply [RFC 8166, Section 3.3.3]. Unfortunately there appears to be a longstanding bug in the Linux client's credit accounting mechanism. On connect, it simply dumps all pending RPC Calls onto the new connection. It's possible it has done this ever since the RPC/RDMA transport was added to the kernel ten years ago. Servers have so far been tolerant of this bad behavior. Currently no server implementation ever changes its credit grant over reconnects, and servers always repost enough Receives before connections are fully established. The Linux client implementation used to post a Receive before each of these Calls. This has covered up the flooding send behavior. I could try to correct this old bug so that the client sends exactly one RPC Call and waits for a Reply. Since we are so close to the next merge window, I'm going to instead provide a simple patch to post enough Receives before a reconnect completes (based on the number of credits granted to the previous connection). The spurious disconnects will be gone, but the client will still send multiple RPC Calls immediately after a reconnect. Addressing the latter problem will wait for a merge window because a) I expect it to be a large change requiring lots of testing, and b) obviously the Linux client has interoperated successfully since day zero while still being broken. Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments constBart Van Assche
Since neither ib_post_send() nor ib_post_recv() modify the data structure their second argument points at, declare that argument const. This change makes it necessary to declare the 'bad_wr' argument const too and also to modify all ULPs that call ib_post_send(), ib_post_recv() or ib_post_srq_recv(). This patch does not change any functionality but makes it possible for the compiler to verify whether the ib_post_(send|recv|srq_recv) really do not modify the posted work request. To make this possible, only one cast had to be introduce that casts away constness, namely in rpcrdma_post_recvs(). The only way I can think of to avoid that cast is to introduce an additional loop in that function or to change the data type of bad_wr from struct ib_recv_wr ** into int (an index that refers to an element in the work request list). However, both approaches would require even more extensive changes than this patch. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/core: add max_send_sge and max_recv_sge attributesSteve Wise
This patch replaces the ib_device_attr.max_sge with max_send_sge and max_recv_sge. It allows ulps to take advantage of devices that have very different send and recv sge depths. For example cxgb4 has a max_recv_sge of 4, yet a max_send_sge of 16. Splitting out these attributes allows much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW API. Consider a large RDMA WRITE that has 16 scattergather entries. With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of 16, it can be done with 1 WRITE WR. Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-12Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs() - Revert an incorrect change to the NFSv4.1 callback channel - Fix a bug in the NFSv4.1 sequence error handling Features and optimisations: - Support for piggybacking a LAYOUTGET operation to the OPEN compound - RDMA performance enhancements to deal with transport congestion - Add proper SPDX tags for NetApp-contributed RDMA source - Do not request delegated file attributes (size+change) from the server - Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN - Optimise away unnecessary lookups for rename targets - Misc performance improvements when freeing NFSv4 delegations Bugfixes and cleanups: - Try to fail quickly if proto=rdma - Clean up RDMA receive trace points - Fix sillyrename to return the delegation when appropriate - Misc attribute revalidation fixes - Immediately clear the pNFS layout on a file when the server returns ESTALE - Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab() - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY" * tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits) skip LAYOUTRETURN if layout is invalid NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY NFSv4: Fix a typo in nfs41_sequence_process NFSv4: Revert commit 5f83d86cf531d ("NFSv4.x: Fix wraparound issues..") NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab() NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab() NFSv4.0: Remove transport protocol name from non-UCS client ID NFSv4.0: Remove cl_ipaddr from non-UCS client ID NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined NFS: Filter cache invalidation when holding a delegation NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes() NFS: Improve caching while holding a delegation NFS: Fix attribute revalidation NFS: fix up nfs_setattr_update_inode NFSv4: Ensure the inode is clean when we set a delegation NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access NFSv4: Don't ask for delegated attributes when adding a hard link NFSv4: Don't ask for delegated attributes when revalidating the inode NFS: Pass the inode down to the getattr() callback NFSv4: Don't request size+change attribute if they are delegated to us ...
2018-06-12Merge tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "A relatively quiet cycle for nfsd. The largest piece is an RDMA update from Chuck Lever with new trace points, miscellaneous cleanups, and streamlining of the send and receive paths. Other than that, some miscellaneous bugfixes" * tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux: (26 commits) nfsd: fix error handling in nfs4_set_delegation() nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo Fix 16-byte memory leak in gssp_accept_sec_context_upcall svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs svcrdma: Remove unused svc_rdma_op_ctxt svcrdma: Persistently allocate and DMA-map Send buffers svcrdma: Simplify svc_rdma_send() svcrdma: Remove post_send_wr svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt svcrdma: Introduce svc_rdma_send_ctxt svcrdma: Clean up Send SGE accounting svcrdma: Refactor svc_rdma_dma_map_buf svcrdma: Allocate recv_ctxt's on CPU handling Receives svcrdma: Persistently allocate and DMA-map Receive buffers svcrdma: Preserve Receive buffer until svc_rdma_sendto svcrdma: Simplify svc_rdma_recv_ctxt_put svcrdma: Remove sc_rq_depth svcrdma: Introduce svc_rdma_recv_ctxt svcrdma: Trace key RDMA API events svcrdma: Trace key RPC/RDMA protocol events ...
2018-06-04Merge tag 'nfs-rdma-for-4.18-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFS-over-RDMA client updates for Linux 4.18 Stable patches: - xprtrdma: Return -ENOBUFS when no pages are available New features: - Add ->alloc_slot() and ->free_slot() functions Bugfixes and cleanups: - Add missing SPDX tags to some files - Try to fail mount quickly if client has no RDMA devices - Create transport IDs in the correct network namespace - Fix max_send_wr computation - Clean up receive tracepoints - Refactor receive handling - Remove unused functions
2018-06-01xprtrdma: Wait on empty sendctx queueChuck Lever
Currently, when the sendctx queue is exhausted during marshaling, the RPC/RDMA transport places the RPC task on the delayq, which forces a wait for HZ >> 2 before the marshal and send is retried. With this change, the transport now places such an RPC task on the pending queue, and wakes it just as soon as more sendctxs become available. This typically takes less than a millisecond, and the write_space waking mechanism is less deadlock-prone. Moreover, the waiting RPC task is holding the transport's write lock, which blocks the transport from sending RPCs. Therefore faster recovery from sendctx queue exhaustion is desirable. Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN when there are no free MRs"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-11xprtrdma: Prepare RPC/RDMA includes for server-side trace pointsChuck Lever
Clean up: Move #include <trace/events/rpcrdma.h> into source files, similar to how it is done with trace/events/sunrpc.h. Server-side trace points will be part of the rpcrdma subsystem, just like the client-side trace points. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-05-07xprtrdma: Make rpcrdma_sendctx_put_locked() a static functionChuck Lever
Clean up: The only call site is in the same file as the function's definition. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_buffer_get_rep_locked()Chuck Lever
Clean up: There is only one remaining call site for this helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_buffer_get_req_locked()Chuck Lever
Clean up. There is only one call-site for this helper, and it can be simplified by using list_first_entry_or_null(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Remove rpcrdma_ep_{post_recv, post_extra_recv}Chuck Lever
Clean up: These functions are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Move Receive posting to Receive handlerChuck Lever
Receive completion and Reply handling are done by a BOUND workqueue, meaning they run on only one CPU. Posting receives is currently done in the send_request path, which on large systems is typically done on a different CPU than the one handling Receive completions. This results in movement of Receive-related cachelines between the sending and receiving CPUs. More importantly, it means that currently Receives are posted while the transport's write lock is held, which is unnecessary and costly. Finally, allocation of Receive buffers is performed on-demand in the Receive completion handler. This helps guarantee that they are allocated on the same NUMA node as the CPU that handles Receive completions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Clean up Receive trace pointsChuck Lever
For clarity, report the posting and completion of Receive CQEs. Also, the wc->byte_len field contains garbage if wc->status is non-zero, and the vendor error field contains garbage if wc->status is zero. For readability, don't save those fields in those cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Fix max_send_wr computationChuck Lever
For FRWR, the computation of max_send_wr is split between frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell that the max_send_wr result is currently incorrect if frwr_op_open has to reduce the credit limit to accommodate a small max_qp_wr. This is a problem now that extra WRs are needed for backchannel operations and a drain CQE. So, refactor the computation so that it is all done in ->ro_open, and fix the FRWR version of this computation so that it accommodates HCAs with small max_qp_wr correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Create transport's CM ID in the correct network namespaceChuck Lever
Set up RPC/RDMA transport in mount.nfs's network namespace. This passes the correct namespace information to the RDMA core, similar to how RPC sockets are created (see xs_create_sock). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Try to fail quickly if proto=rdmaChuck Lever
rdma_resolve_addr(3) says: > This call is used to map a given destination IP address to a > usable RDMA address. The IP to RDMA address mapping is done > using the local routing tables, or via ARP. If this can't be done, there's no local device that can be used to establish an RDMA-capable network path to the remote. In this case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR upcall. Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH forever, thinking that this is a temporary situation. This makes mount.nfs appear to hang if I try to mount with proto=rdma through, say, a conventional Ethernet device. If the admin has specified proto=rdma along with a server IP address that requires a network path that does not support RDMA, instead let's fail with a permanent error. -EPROTONOSUPPORT is returned when NFSv4 or one of its minor versions is not supported. -EPROTO is not (currently) retried by mount.nfs. There are potentially other similar cases where -EPROTO is an appropriate return code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <kolga@netapp.com> Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-07xprtrdma: Add proper SPDX tags for NetApp-contributed sourceChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-05-01xprtrdma: Fix list corruption / DMAR errors during MR recoveryChuck Lever
The ro_release_mr methods check whether mr->mr_list is empty. Therefore, be sure to always use list_del_init when removing an MR linked into a list using that field. Otherwise, when recovering from transport failures or device removal, list corruption can result, or MRs can get mapped or unmapped an odd number of times, resulting in IOMMU-related failures. In general this fix is appropriate back to v4.8. However, code changes since then make it impossible to apply this patch directly to stable kernels. The fix would have to be applied by hand or reworked for kernels earlier than v4.16. Backport guidance -- there are several cases: - When creating an MR, initialize mr_list so that using list_empty on an as-yet-unused MR is safe. - When an MR is being handled by the remote invalidation path, ensure that mr_list is reinitialized when it is removed from rl_registered. - When an MR is being handled by rpcrdma_destroy_mrs, it is removed from mr_all, but it may still be on an rl_registered list. In that case, the MR needs to be removed from that list before being released. - Other cases are covered by using list_del_init in rpcrdma_mr_pop. Fixes: 9d6b04097882 ('xprtrdma: Place registered MWs on a ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>