summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)Author
2015-11-02svcrdma: Add backward direction service for RPC/RDMA transportChuck Lever
On NFSv4.1 mount points, the Linux NFS client uses this transport endpoint to receive backward direction calls and route replies back to the NFSv4.1 server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Handle incoming backward direction RPC callsChuck Lever
Introduce a code path in the rpcrdma_reply_handler() to catch incoming backward direction RPC calls and route them to the ULP's backchannel server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Add support for sending backward direction RPC repliesChuck Lever
Backward direction RPC replies are sent via the client transport's send_request method, the same way forward direction RPC calls are sent. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Pre-allocate Work Requests for backchannelChuck Lever
Pre-allocate extra send and receive Work Requests needed to handle backchannel receives and sends. The transport doesn't know how many extra WRs to pre-allocate until the xprt_setup_backchannel() call, but that's long after the WRs are allocated during forechannel setup. So, use a fixed value for now. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffersChuck Lever
xprtrdma's backward direction send and receive buffers are the same size as the forechannel's inline threshold, and must be pre- registered. The consumer has no control over which receive buffer the adapter chooses to catch an incoming backwards-direction call. Any receive buffer can be used for either a forward reply or a backward call. Thus both types of RPC message must all be the same size. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Saving IRQs no longer needed for rb_lockChuck Lever
Now that RPC replies are processed in a workqueue, there's no need to disable IRQs when managing send and receive buffers. This saves noticeable overhead per RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Remove reply taskletChuck Lever
Clean up: The reply tasklet is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Use workqueue to process RPC/RDMA repliesChuck Lever
The reply tasklet is fast, but it's single threaded. After reply traffic saturates a single CPU, there's no more reply processing capacity. Replace the tasklet with a workqueue to spread reply handling across all CPUs. This also moves RPC/RDMA reply handling out of the soft IRQ context and into a context that allows sleeps. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Replace send and receive arraysChuck Lever
The rb_send_bufs and rb_recv_bufs arrays are used to implement a pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep structs. Replace those arrays with free lists. To allow more than 512 RPCs in-flight at once, each of these arrays would be larger than a page (assuming 8-byte addresses and 4KB pages). Allowing up to 64K in-flight RPCs (as TCP now does), each buffer array would have to be 128 pages. That's an order-6 allocation. (Not that we're going there.) A list is easier to expand dynamically. Instead of allocating a larger array of pointers and copying the existing pointers to the new array, simply append more buffers to each list. This also makes it simpler to manage receive buffers that might catch backwards-direction calls, or to post receive buffers in bulk to amortize the overhead of ib_post_recv. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Refactor reply handler error handlingChuck Lever
Clean up: The error cases in rpcrdma_reply_handler() almost never execute. Ensure the compiler places them out of the hot path. No behavior change expected. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Prevent loss of completion signalsChuck Lever
Commit 8301a2c047cc ("xprtrdma: Limit work done by completion handler") was supposed to prevent xprtrdma's upcall handlers from starving other softIRQ work by letting them return to the provider before all CQEs have been polled. The logic assumes the provider will call the upcall handler again immediately if the CQ is re-armed while there are still queued CQEs. This assumption is invalid. The IBTA spec says that after a CQ is armed, the hardware must interrupt only when a new CQE is inserted. xprtrdma can't rely on the provider calling again, even though some providers do. Therefore, leaving CQEs on queue makes sense only when there is another mechanism that ensures all remaining CQEs are consumed in a timely fashion. xprtrdma does not have such a mechanism. If a CQE remains queued, the transport can wait forever to send the next RPC. Finally, move the wcs array back onto the stack to ensure that the poll array is always local to the CPU where the completion upcall is running. Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Re-arm after missed eventsChuck Lever
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive value if WCs were added to a CQ after the last completion upcall but before the CQ has been re-armed. Commit 7f23f6f6e388 ("xprtrmda: Reduce lock contention in completion handlers") assumed that when ib_req_notify_cq() returned a positive RC, the CQ had also been successfully re-armed, making it safe to return control to the provider without losing any completion signals. That is an invalid assumption. Change both completion handlers to continue polling while ib_req_notify_cq() returns a positive value. Fixes: 7f23f6f6e388 ("xprtrmda: Reduce lock contention in ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: Enable swap-on-NFS/RDMAChuck Lever
After adding a swapfile on an NFS/RDMA mount and removing the normal swap partition, I was able to push the NFS client well into swap without any issue. I forgot to swapoff the NFS file before rebooting. This pinned the NFS mount and the IB core and provider, causing shutdown to hang. I think this is expected and safe behavior. Probably shutdown scripts should "swapoff -a" before unmounting any filesystems. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-By: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-11-02xprtrdma: don't log warnings for flushed completionsSteve Wise
Unsignaled send WRs can get flushed as part of normal unmount, so don't log them as warnings. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-10-28svcrdma: Port to new memory registration APISagi Grimberg
Instead of maintaining a fastreg page list, keep an sg table and convert an array of pages to a sg list. Then call ib_map_mr_sg and construct ib_reg_wr. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Christoph Hellwig <hch@lst.de> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Selvin Xavier <selvin.xavier@avagotech.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28xprtrdma: Port to new memory registration APISagi Grimberg
Instead of maintaining a fastreg page list, keep an sg table and convert an array of pages to a sg list. Then call ib_map_mr_sg and construct ib_reg_wr. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Christoph Hellwig <hch@lst.de> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Selvin Xavier <selvin.xavier@avagotech.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-28Merge branch 'wr-cleanup' into k.o/for-4.4Doug Ledford
2015-10-28IB/cma: Add support for network namespacesGuy Shapiro
Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma, when listening on an ID and when looking for an ID for an incoming request. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro <guysh@mellanox.com> Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Yotam Kenneth <yotamke@mellanox.com> Signed-off-by: Shachar Raindel <raindel@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-15Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "We have four batched up patches for the current rc kernel. Two of them are small fixes that are obvious. One of them is larger than I would like for a late stage rc pull, but we found an issue in the namespace lookup code related to RoCE and this works around the issue for now (we allow a lookup with a namespace to succeed on RoCE since RoCE namespaces aren't implemented yet). This will go away in 4.4 when we put in support for namespaces in RoCE devices. The last one is large in terms of lines, but is all legal and no functional changes. Cisco needed to update their files to be more specific about their license. They had intended the files to be dual licensed as GPL/BSD all along, and specified that in their module license tag, but their file headers were not up to par. They contacted all of the contributors to get agreement and then submitted a patch to update the license headers in the files. Summary: - Work around connection namespace lookup bug related to RoCE - Change usnic license to Dual GPL/BSD (was intended to be that way all along, but wasn't clear, permission from contributors was chased down) - Fix an issue between NFSoRDMA and mlx5 that could cause an oops - Fix leak of sendonly multicast groups" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: IB/ipoib: For sendonly join free the multicast group on leave IB/cma: Accept connection without a valid netdev on RoCE xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg usnic: add missing clauses to BSD license
2015-10-13Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol bug" * tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux: svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE nfsd/blocklayout: accept any minlength
2015-10-12svcrdma: Fix NFS server crash triggered by 1MB NFS WRITEChuck Lever
Now that the NFS server advertises a maximum payload size of 1MB for RPC/RDMA again, it crashes in svc_process_common() when NFS client sends a 1MB NFS WRITE on an NFS/RDMA mount. The server has set up a 259 element array of struct page pointers in rq_pages[] for each incoming request. The last element of the array is NULL. When an incoming request has been completely received, rdma_read_complete() attempts to set the starting page of the incoming page vector: rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count]; and the page to use for the reply: rqstp->rq_respages = &rqstp->rq_arg.pages[page_no]; But the value of page_no has already accounted for head->hdr_count. Thus rq_respages now points past the end of the incoming pages. For NFS WRITE operations smaller than the maximum, this is harmless. But when the NFS WRITE operation is as large as the server's max payload size, rq_respages now points at the last entry in rq_pages, which is NULL. Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .') BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-09Merge tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd bugfix from Bruce Fields: "Just one RDMA bugfix" * tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux: svcrdma: handle rdma read with a non-zero initial page offset
2015-10-08IB: split struct ib_send_wrChristoph Hellwig
This patch split up struct ib_send_wr so that all non-trivial verbs use their own structure which embedds struct ib_send_wr. This dramaticly shrinks the size of a WR for most common operations: sizeof(struct ib_send_wr) (old): 96 sizeof(struct ib_send_wr): 48 sizeof(struct ib_rdma_wr): 64 sizeof(struct ib_atomic_wr): 96 sizeof(struct ib_ud_wr): 88 sizeof(struct ib_fast_reg_wr): 88 sizeof(struct ib_bind_mw_wr): 96 sizeof(struct ib_sig_handover_wr): 80 And with Sagi's pending MR rework the fast registration WR will also be down to a reasonable size: sizeof(struct ib_fastreg_wr): 64 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt] Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc] Tested-by: Haggai Eran <haggaie@mellanox.com> Tested-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Steve Wise <swise@opengridcomputing.com>
2015-10-06xprtrdma: Don't require LOCAL_DMA_LKEY support for fastregSagi Grimberg
There is no need to require LOCAL_DMA_LKEY support as the PD allocation makes sure that there is a local_dma_lkey. Also correctly set a return value in error path. This caused a NULL pointer dereference in mlx5 which removed the support for LOCAL_DMA_LKEY. Fixes: bb6c96d72879 ("xprtrdma: Replace global lkey with lkey local to PD") Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-10-02Merge tag 'nfs-rdma-for-4.3-2' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust
NFS: NFSoRDMA bugfix Fixes a use-after-free bug. Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
2015-09-29svcrdma: handle rdma read with a non-zero initial page offsetSteve Wise
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions were not taking into account the initial page_offset when determining the rdma read length. This resulted in a read who's starting address and length exceeded the base/bounds of the frmr. The server gets an async error from the rdma device and kills the connection, and the client then reconnects and resends. This repeats indefinitely, and the application hangs. Most work loads don't tickle this bug apparently, but one test hit it every time: building the linux kernel on a 16 core node with 'make -j 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. This bug seems to only be tripped with devices having small fastreg page list depths. I didn't see it with mlx4, for instance. Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') Signed-off-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-09-28xprtrdma: disconnect and flush cqs before freeing buffersSteve Wise
Otherwise a FRMR completion can cause a touch-after-free crash. In xprt_rdma_destroy(), call rpcrdma_buffer_destroy() only after calling rpcrdma_ep_destroy(). In rpcrdma_ep_destroy(), disconnect the cm_id first which should flush the qp, then drain the cqs, then destroy the qp, and finally destroy the cqs. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-09-25xprtrdma: Replace global lkey with lkey local to PDChuck Lever
The core API has changed so that devices that do not have a global DMA lkey automatically create an mr, per-PD, and make that lkey available. The global DMA lkey interface is going away in favor of the per-PD DMA lkey. The per-PD DMA lkey is always available. Convert xprtrdma to use the device's per-PD DMA lkey for regbufs, no matter which memory registration scheme is in use. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Cc: linux-nfs <linux-nfs@vger.kernel.org> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-09Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull inifiniband/rdma updates from Doug Ledford: "This is a fairly sizeable set of changes. I've put them through a decent amount of testing prior to sending the pull request due to that. There are still a few fixups that I know are coming, but I wanted to go ahead and get the big, sizable chunk into your hands sooner rather than waiting for those last few fixups. Of note is the fact that this creates what is intended to be a temporary area in the drivers/staging tree specifically for some cleanups and additions that are coming for the RDMA stack. We deprecated two drivers (ipath and amso1100) and are waiting to hear back if we can deprecate another one (ehca). We also put Intel's new hfi1 driver into this area because it needs to be refactored and a transfer library created out of the factored out code, and then it and the qib driver and the soft-roce driver should all be modified to use that library. I expect drivers/staging/rdma to be around for three or four kernel releases and then to go away as all of the work is completed and final deletions of deprecated drivers are done. Summary of changes for 4.3: - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits) IB/ipoib: Suppress warning for send only join failures IB/ipoib: Clean up send-only multicast joins IB/srp: Fix possible protection fault IB/core: Move SM class defines from ib_mad.h to ib_smi.h IB/core: Remove unnecessary defines from ib_mad.h IB/hfi1: Add PSM2 user space header to header_install IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY mlx5: Fix incorrect wc pkey_index assignment for GSI messages IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow IB/uverbs: reject invalid or unknown opcodes IB/cxgb4: Fix if statement in pick_local_ip6adddrs IB/sa: Fix rdma netlink message flags IB/ucma: HW Device hot-removal support IB/mlx4_ib: Disassociate support IB/uverbs: Enable device removal when there are active user space applications IB/uverbs: Explicitly pass ib_dev to uverbs commands IB/uverbs: Fix race between ib_uverbs_open and remove_one IB/uverbs: Fix reference counting usage of event files IB/core: Make ib_dealloc_pd return void IB/srp: Create an insecure all physical rkey only if needed ...
2015-09-07Merge tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix atomicity of pNFS commit list updates - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY) - nfs_set_pgio_error sometimes misses errors - Fix a thinko in xs_connect() - Fix borkage in _same_data_server_addrs_locked() - Fix a NULL pointer dereference of migration recovery ops for v4.2 client - Don't let the ctime override attribute barriers. - Revert "NFSv4: Remove incorrect check in can_open_delegated()" - Ensure flexfiles pNFS driver updates the inode after write finishes - flexfiles must not pollute the attribute cache with attrbutes from the DS - Fix a protocol error in layoutreturn - Fix a protocol issue with NFSv4.1 CLOSE stateids Bugfixes + cleanups - pNFS blocks bugfixes from Christoph - Various cleanups from Anna - More fixes for delegation corner cases - Don't fsync twice for O_SYNC/IS_SYNC files - Fix pNFS and flexfiles layoutstats bugs - pnfs/flexfiles: avoid duplicate tracking of mirror data - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation issues - pnfs/flexfiles: error handling retries a layoutget before fallback to MDS Features: - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from Kinglong - More RDMA client transport improvements from Chuck - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr() verbs from the SUNRPC, Lustre and core infiniband tree. - Optimise away the close-to-open getattr if there is no cached data" * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits) NFSv4: Respect the server imposed limit on how many changes we may cache NFSv4: Express delegation limit in units of pages Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files" NFS: Optimise away the close-to-open getattr if there is no cached data NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error() nfs: Remove unneeded checking of the return value from scnprintf nfs: Fix truncated client owner id without proto type NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid() NFSv4.1/flexfiles: Fix freeing of mirrors NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file NFSv4.1/pnfs: Handle LAYOUTGET return values correctly NFSv4.1/pnfs: Don't ask for a read layout for an empty file. NFSv4.1: Fix a protocol issue with CLOSE stateids NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors SUNRPC: Prevent SYN+SYNACK+RST storms SUNRPC: xs_reset_transport must mark the connection as disconnected NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload ...
2015-08-30IB/core: Make ib_dealloc_pd return voidJason Gunthorpe
The majority of callers never check the return value, and even if they did, they can't do anything about a failure. All possible failure cases represent a bug in the caller, so just WARN_ON inside the function instead. This fixes a few random errors: net/rd/iw.c infinite loops while it fails. (racing with EBUSY?) This also lays the ground work to get rid of error return from the drivers. Most drivers do not error, the few that do are broken since it cannot be handled. Since uverbs can legitimately make use of EBUSY, open code the check. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30svcrdma: limit FRMR page list lengths to device maxSteve Wise
Svcrdma was incorrectly allocating fastreg MRs and page lists using RPCSVC_MAXPAGES, which can exceed the device capabilities. So limit the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-30xprtrdma, svcrdma: Convert to ib_alloc_mrSagi Grimberg
Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-28svcrdma: Use max_sge_rd for destination read depthsSteve Wise
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-08-10svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOADChuck Lever
Both commit 0380a3f375 ("svcrdma: Add a separate "max data segs" macro for svcrdma") and commit 7e5be28827bf ("svcrdma: advertise the correct max payload") are incorrect. This commit reverts both changes, restoring the server's maximum payload size to 1MB. Commit 7e5be28827bf based the server's maximum payload on the _client's_ RPCRDMA_MAX_DATA_SEGS value. That was wrong. Commit 0380a3f375 tried to fix this so that the client maximum payload size could be raised without affecting the server, but managed to confuse matters more on the server side. More importantly, limiting the advertised maximum payload size was meant to be a workaround, not the actual fix. We need to revisit https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 A Linux client on a platform with 64KB pages can overrun and crash an x86_64 NFS/RDMA server when the r/wsize is 1MB. An x86/64 Linux client seems to work fine using 1MB reads and writes when the Linux server's maximum payload size is restored to 1MB. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 Fixes: 0380a3f375 ("svcrdma: Add a separate "max data segs" macro") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-05xprtrdma: take HCA driver refcount at clientDevesh Sharma
This is a rework of the following patch sent almost a year back: http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html In presence of active mount if someone tries to rmmod vendor-driver, the command remains stuck forever waiting for destruction of all rdma-cm-id. in worst case client can crash during shutdown with active mounts. The existing code assumes that ia->ri_id->device cannot change during the lifetime of a transport. xprtrdma do not have support for DEVICE_REMOVAL event either. Lifting that assumption and adding support for DEVICE_REMOVAL event is a long chain of work, and is in plan. The community decided that preventing the hang right now is more important than waiting for architectural changes. Thus, this patch introduces a temporary workaround to acquire HCA driver module reference count during the mount of a nfs-rdma mount point. Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Count RDMA_NOMSG type callsChuck Lever
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG calls so administrators can tell if they happen to be used more than expected. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Clean up xprt_rdma_print_stats()Chuck Lever
checkpatch.pl complained about the seq_printf() format string split across lines and the use of %Lu. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Fix large NFS SYMLINK callsChuck Lever
Repair how rpcrdma_marshal_req() chooses which RDMA message type to use for large non-WRITE operations so that it picks RDMA_NOMSG in the correct situations, and sets up the marshaling logic to SEND only the RPC/RDMA header. Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS server XDR decoder for NFSv2 SYMLINK does not handle having the pathname argument arrive in a separate buffer. The decoder could be fixed, but this is simpler and RDMA_NOMSG can be used in a variety of other situations. Ensure that the Linux client continues to use "RDMA_MSG + read list" when sending large NFSv3 SYMLINK requests, which is more efficient than using RDMA_NOMSG. Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG + read list" just like NFSv3 (see Section 5 of RFC 5667). Before, these did not work at all. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Fix XDR tail buffer marshallingChuck Lever
Currently xprtrdma appends an extra chunk element to the RPC/RDMA read chunk list of each NFSv4 WRITE compound. The extra element contains the final GETATTR operation in the compound. The result is an extra RDMA READ operation to transfer a very short piece of each NFS WRITE compound (typically 16 bytes). This is inefficient. It is also incorrect. The client is sending the trailing GETATTR at the same Position as the preceding WRITE data payload. Whether or not RFC 5667 allows the GETATTR to appear in a read chunk, RFC 5666 requires that these two separate RPC arguments appear at two distinct Positions. It can also be argued that the GETATTR operation is not bulk data, and therefore RFC 5667 forbids its appearance in a read chunk at all. Although RFC 5667 is not precise about when using a read list with NFSv4 COMPOUND is allowed, the intent is that only data arguments not touched by NFS (ie, read and write payloads) are to be sent using RDMA READ or WRITE. The NFS client constructs GETATTR arguments itself, and therefore is required to send the trailing GETATTR operation as additional inline content, not as a data payload. NB: This change is not backwards compatible. Some older servers do not accept inline content following the read list. The Linux NFS server should handle this content correctly as of commit a97c331f9aa9 ("svcrdma: Handle additional inline content"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Don't provide a reply chunk when expecting a short replyChuck Lever
Currently Linux always offers a reply chunk, even when the reply can be sent inline (ie. is smaller than 1KB). On the client, registering a memory region can be expensive. A server may choose not to use the reply chunk, wasting the cost of the registration. This is a change only for RPC replies smaller than 1KB which the server constructs in the RPC reply send buffer. Because the elements of the reply must be XDR encoded, a copy-free data transfer has no benefit in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Always provide a write list when sending NFS READChuck Lever
The client has been setting up a reply chunk for NFS READs that are smaller than the inline threshold. This is not efficient: both the server and client CPUs have to copy the reply's data payload into and out of the memory region that is then transferred via RDMA. Using the write list, the data payload is moved by the device and no extra data copying is necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Account for RPC/RDMA header size when deciding to inlineChuck Lever
When the size of the RPC message is near the inline threshold (1KB), the client would allow messages to be sent that were a few bytes too large. When marshaling RPC/RDMA requests, ensure the combined size of RPC/RDMA header and RPC header do not exceed the inline threshold. Endpoints typically reject RPC/RDMA messages that exceed the size of their receive buffers. The two server implementations I test with (Linux and Solaris) use receive buffers that are larger than the client’s inline threshold. Thus so far this has been benign, observed only by code inspection. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Remove logic that constructs RDMA_MSGP type callsChuck Lever
RDMA_MSGP type calls insert a zero pad in the middle of the RPC message to align the RPC request's data payload to the server's alignment preferences. A server can then "page flip" the payload into place to avoid a data copy in certain circumstances. However: 1. The client has to have a priori knowledge of the server's preferred alignment 2. Requests eligible for RDMA_MSGP are requests that are small enough to have been sent inline, and convey a data payload at the _end_ of the RPC message Today 1. is done with a sysctl, and is a global setting that is copied during mount. Linux does not support CCP to query the server's preferences (RFC 5666, Section 6). A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4 compound fits bullet 2. Thus the Linux client currently leaves RDMA_MSGP disabled. The Linux server handles RDMA_MSGP, but does not use any special page flipping, so it confers no benefit. Clean up the marshaling code by removing the logic that constructs RDMA_MSGP type calls. This also reduces the maximum send iovec size from four to just two elements. /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and thus is left in place. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Clean up rpcrdma_ia_open()Chuck Lever
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which is different for each registration method, to the .ro_open functions. This is refactoring only. No behavior change is expected. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Remove last ib_reg_phys_mr() call siteChuck Lever
All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr() fails, rpcrdma_ia_open() fails and no transport is created. Therefore execution never reaches the ib_reg_phys_mr() call site in rpcrdma_register_internal(), so it can be removed. The remaining logic in rpcrdma_{de}register_internal() is folded into rpcrdma_{alloc,free}_regbuf(). This is clean up only. No behavior change is expected. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Don't fall back to PHYSICAL memory registrationChuck Lever
PHYSICAL memory registration uses a single rkey for all of the client's memory, thus is insecure. It is still useful in some cases for testing. Retain the ability to select PHYSICAL memory registration capability via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it if the HCA does not support FRWR or FMR. This means amso1100 no longer works out of the box with NFS/RDMA. When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before performing NFS/RDMA mounts. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Raise maximum payload size to one megabyteChuck Lever
The point of larger rsize and wsize is to reduce the per-byte cost of memory registration and deregistration. Modern HCAs can typically handle a megabyte or more with a single registration operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05xprtrdma: Make xprt_setup_rdma() agnostic to family of server addressChuck Lever
In particular, recognize when an IPv6 connection is bound. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-07-20svcrdma: Remove svc_rdma_fastreg()Chuck Lever
Commit 0bf4828983df ("svcrdma: refactor marshalling logic") removed the last call site for svc_rdma_fastreg(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>