summaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)Author
2016-03-14xprtrdma: Clean up dprintk format string containing a newlineChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-14xprtrdma: Clean up physical_op_map()Chuck Lever
physical_op_unmap{_sync} don't use mr_nsegs, so don't bother to set it in physical_op_map. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-03-01svcrdma: Use new CQ API for RPC-over-RDMA server send CQsChuck Lever
Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. This new API also aims each completion at a function that is specific to the WR's opcode. Thus the ctxt->wr_op field and the switch in process_context is replaced by a set of methods that handle each completion type. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. The server's rdma_stat_sq_poll and rdma_stat_sq_prod metrics are no longer updated. As a clean up, the cq_event_handler, the dto_tasklet, and all associated locking is removed, as they are no longer referenced or used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Use new CQ API for RPC-over-RDMA server receive CQsChuck Lever
Calling ib_poll_cq() to sort through WCs during a completion is a common pattern amongst RDMA consumers. Since commit 14d3a3b2498e ("IB: add a proper completion queue abstraction"), WC sorting can be handled by the IB core. By converting to this new API, svcrdma is made a better neighbor to other RDMA consumers, as it allows the core to schedule the delivery of completions more fairly amongst all active consumers. Because each ib_cqe carries a pointer to a completion method, the core can now post operations on a consumer's QP, and handle the completions itself. svcrdma receive completions no longer use the dto_tasklet. Each polled Receive WC is now handled individually in soft IRQ context. The server transport's rdma_stat_rq_poll and rdma_stat_rq_prod metrics are no longer updated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Remove close_out exit pathChuck Lever
Clean up: close_out is reached only when ctxt == NULL and XPT_CLOSE is already set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Hook up the logic to return ERR_CHUNKChuck Lever
RFC 5666 Section 4.2 states: > When the peer detects an RPC-over-RDMA header version that it does > not support (currently this document defines only version 1), it > replies with an error code of ERR_VERS, and provides the low and > high inclusive version numbers it does, in fact, support. And: > When other decoding errors are detected in the header or chunks, > either an RPC decode error MAY be returned or the RPC/RDMA error > code ERR_CHUNK MUST be returned. The Linux NFS server does throw ERR_VERS when a client sends it a request whose rdma_version is not "one." But it does not return ERR_CHUNK when a header decoding error occurs. It just drops the request. To improve protocol extensibility, it should reject invalid values in the rdma_proc field instead of treating them all like RDMA_MSG. Otherwise clients can't detect when the server doesn't support new rdma_proc values. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Use correct XID in error repliesChuck Lever
When constructing an error reply, svc_rdma_xdr_encode_error() needs to view the client's request message so it can get the failing request's XID. svc_rdma_xdr_decode_req() is supposed to return a pointer to the client's request header. But if it fails to decode the client's message (and thus an error reply is needed) it does not return the pointer. The server then sends a bogus XID in the error reply. Instead, unconditionally generate the pointer to the client's header in svc_rdma_recvfrom(), and pass that pointer to both functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Make RDMA_ERROR messages workChuck Lever
Fix several issues with svc_rdma_send_error(): - Post a receive buffer to replace the one that was consumed by the incoming request - Posting a send should use DMA_TO_DEVICE, not DMA_FROM_DEVICE - No need to put_page _and_ free pages in svc_rdma_put_context - Make sure the sge is set up completely in case the error path goes through svc_rdma_unmap_dma() - Replace the use of ENOSYS, which has a reserved meaning Related fixes in svc_rdma_recvfrom(): - Don't leak the ctxt associated with the incoming request - Don't close the connection after sending an error reply - Let svc_rdma_send_error() figure out the right header error code As a last clean up, move svc_rdma_send_error() to svc_rdma_sendto.c with other similar functions. There is some common logic in these functions that could someday be combined to reduce code duplication. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: svc_rdma_post_recv() should close connection on errorChuck Lever
Clean up: Most svc_rdma_post_recv() call sites close the transport connection when a receive cannot be posted. Wrap that in a common helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Tested-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Close connection when a send error occursChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01nfsd: Lower NFSv4.1 callback message size limitChuck Lever
The maximum size of a backchannel message on RPC-over-RDMA depends on the connection's inline threshold. Today that threshold is typically 1024 bytes, making the maximum message size 996 bytes. The Linux server's CREATE_SESSION operation checks that the size of callback Calls can be as large as 1044 bytes, to accommodate RPCSEC_GSS. Thus CREATE_SESSION fails if a client advertises the true message size maximum of 996 bytes. But the server's backchannel currently does not support RPCSEC_GSS. The actual maximum size it needs is much smaller. It is safe to reduce the limit to enable NFSv4.1 on RDMA backchannel operation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Do not send Write chunk XDR pad with inline contentChuck Lever
The NFS server's XDR encoders adds an XDR pad for content in the xdr_buf page list at the beginning of the xdr_buf's tail buffer. On RDMA transports, Write chunks are sent separately and without an XDR pad. If a Write chunk is being sent, strip off the pad in the tail buffer so that inline content following the Write chunk remains XDR-aligned when it is sent to the client. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Do not write xdr_buf::tail in a Write chunkChuck Lever
When the Linux NFS server writes an odd-length data item into a Write chunk, it finishes with XDR pad bytes. If the data item is smaller than the Write chunk, the pad bytes are written at the end of the data item, but still inside the chunk (ie, in the application's buffer). Since this is direct data placement, that exposes the pad bytes. XDR pad bytes are inserted in order to preserve the XDR alignment of the next XDR data item in an XDR stream. But Write chunks do not appear in the payload XDR stream, and only one data item is allowed in each chunk. Thus XDR padding is not needed in a Write chunk. With NFSv4, the Linux NFS server places the results of any operations that follow an NFSv4 READ or READLINK in the xdr_buf's tail. Those results also should never be sent as a part of a Write chunk. The current logic in send_write_chunks() appears to assume that the xdr_buf's tail contains only pad bytes (ie, NFSv3). The server should write only the contents of the xdr_buf's page list in a Write chunk. If there's more than an XDR pad in the tail, that needs to go inline or in the Reply chunk. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=294 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-03-01svcrdma: Find client-provided write and reply chunks once per replyChuck Lever
The client provides the location of Write chunks into which the server writes bulk payload. The client provides these when the Upper Layer Protocol wants direct data placement and the Binding allows it. (For NFS, this is READ and READLINK operations). The client also provides the location of a Reply chunk into which the server writes the non-bulk part of an RPC reply. The client provides this chunk whenever it believes the reply can be larger than its receive buffers. The server then uses the presence of these chunks to determine how it will form its reply message. svc_rdma_sendto() was looking for Write and Reply chunks multiple times for every reply message. It would be more efficient to do it just once. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-02-25Merge tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd bugfix from Bruce Fields: "One fix for a bug that could cause a NULL write past the end of a buffer in case of unusually long writes to some system interfaces used by mountd and other nfs support utilities" * tag 'nfsd-4.5-1' of git://linux-nfs.org/~bfields/linux: sunrpc/cache: fix off-by-one in qword_get()
2016-02-23sunrpc/cache: fix off-by-one in qword_get()Stefan Hajnoczi
The qword_get() function NUL-terminates its output buffer. If the input string is in hex format \xXXXX... and the same length as the output buffer, there is an off-by-one: int qword_get(char **bpp, char *dest, int bufsize) { ... while (len < bufsize) { ... *dest++ = (h << 4) | l; len++; } ... *dest = '\0'; return len; } This patch ensures the NUL terminator doesn't fall outside the output buffer. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-02-22Merge branch 'multipath'Trond Myklebust
* multipath: NFS add callback_ops to nfs4_proc_bind_conn_to_session_callback pnfs/NFSv4.1: Add multipath capabilities to pNFS flexfiles servers over NFSv3 SUNRPC: Allow addition of new transports to a struct rpc_clnt NFSv4.1: nfs4_proc_bind_conn_to_session must iterate over all connections SUNRPC: Make NFS swap work with multipath SUNRPC: Add a helper to apply a function to all the rpc_clnt's transports SUNRPC: Allow caller to specify the transport to use SUNRPC: Use the multipath iterator to assign a transport to each task SUNRPC: Make rpc_clnt store the multipath iterators SUNRPC: Add a structure to track multiple transports SUNRPC: Make freeing of struct xprt rcu-safe SUNRPC: Uninline xprt_get(); It isn't performance critical. SUNRPC: Reorder rpc_task to put waitqueue related info in same cachelines SUNRPC: Remove unused function rpc_task_reset_client
2016-02-18Merge tag 'nfs-rdma-4.5-1' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust
NFS: NFSoRDMA Client Bugfix This patch fixes a bug where NFS v4.1 callbacks were returning RPC_GARBAGE_ARGS to the server. Signed-off-by: Anna Schumaker <Anna@OcarinaProject.net>
2016-02-17auth_gss: fix panic in gss_pipe_downcall() in fips modeScott Mayhew
On Mon, 15 Feb 2016, Trond Myklebust wrote: > Hi Scott, > > On Mon, Feb 15, 2016 at 2:28 PM, Scott Mayhew <smayhew@redhat.com> wrote: > > md5 is disabled in fips mode, and attempting to import a gss context > > using md5 while in fips mode will result in crypto_alg_mod_lookup() > > returning -ENOENT, which will make its way back up to > > gss_pipe_downcall(), where the BUG() is triggered. Handling the -ENOENT > > allows for a more graceful failure. > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > --- > > net/sunrpc/auth_gss/auth_gss.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c > > index 799e65b..c30fc3b 100644 > > --- a/net/sunrpc/auth_gss/auth_gss.c > > +++ b/net/sunrpc/auth_gss/auth_gss.c > > @@ -737,6 +737,9 @@ gss_pipe_downcall(struct file *filp, const char __user *src, size_t mlen) > > case -ENOSYS: > > gss_msg->msg.errno = -EAGAIN; > > break; > > + case -ENOENT: > > + gss_msg->msg.errno = -EPROTONOSUPPORT; > > + break; > > default: > > printk(KERN_CRIT "%s: bad return from " > > "gss_fill_context: %zd\n", __func__, err); > > -- > > 2.4.3 > > > > Well debugged, but I unfortunately do have to ask if this patch is > sufficient? In addition to -ENOENT, and -ENOMEM, it looks to me as if > crypto_alg_mod_lookup() can also fail with -EINTR, -ETIMEDOUT, and > -EAGAIN. Don't we also want to handle those? You're right, I was focusing on the panic that I could easily reproduce. I'm still not sure how I could trigger those other conditions. > > In fact, peering into the rats nest that is > gss_import_sec_context_kerberos(), it looks as if that is just a tiny > subset of all the errors that we might run into. Perhaps the right > thing to do here is to get rid of the BUG() (but keep the above > printk) and just return a generic error? That sounds fine to me -- updated patch attached. -Scott >From d54c6b64a107a90a38cab97577de05f9a4625052 Mon Sep 17 00:00:00 2001 From: Scott Mayhew <smayhew@redhat.com> Date: Mon, 15 Feb 2016 15:12:19 -0500 Subject: [PATCH] auth_gss: remove the BUG() from gss_pipe_downcall() Instead return a generic error via gss_msg->msg.errno. None of the errors returned by gss_fill_context() should necessarily trigger a kernel panic. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-17xprtrdma: rpcrdma_bc_receive_call() should init rq_private_buf.lenChuck Lever
Some NFSv4.1 OPEN requests were hanging waiting for the NFS server to finish recalling delegations. Turns out that each NFSv4.1 CB request on RDMA gets a GARBAGE_ARGS reply from the Linux client. Commit 756b9b37cfb2e3dc added a line in bc_svc_process that overwrites the incoming rq_rcv_buf's length with the value in rq_private_buf.len. But rpcrdma_bc_receive_call() does not invoke xprt_complete_bc_request(), thus rq_private_buf.len is not initialized. svc_process_common() is invoked with a zero-length RPC message, and fails. Fixes: 756b9b37cfb2e3dc ('SUNRPC: Fix callback channel') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-02-05SUNRPC: Allow addition of new transports to a struct rpc_clntTrond Myklebust
Add a function to allow creation and addition of a new transport to an existing rpc_clnt Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Make NFS swap work with multipathTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Add a helper to apply a function to all the rpc_clnt's transportsTrond Myklebust
Add a helper for tasks that require us to apply a function to all the transports in an rpc_clnt. An example of a usecase would be BIND_CONN_TO_SESSION, where we want to send one RPC call down each transport. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Allow caller to specify the transport to useTrond Myklebust
This is needed in order to allow the NFSv4.1 backchannel and BIND_CONN_TO_SESSION function to work. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Use the multipath iterator to assign a transport to each taskTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Make rpc_clnt store the multipath iteratorsTrond Myklebust
This is a pre-patch for the RPC multipath code. It sets up the storage in struct rpc_clnt for the multipath code. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-02-05SUNRPC: Add a structure to track multiple transportsTrond Myklebust
In order to support multipathing/trunking we will need the ability to track multiple transports. This patch sets up a basic structure for doing so. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31SUNRPC: Make freeing of struct xprt rcu-safeTrond Myklebust
Have it call kfree_rcu() to ensure that we can use it on rcu-protected lists. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31SUNRPC: Uninline xprt_get(); It isn't performance critical.Trond Myklebust
Also allow callers to pass NULL arguments to xprt_get() and xprt_put(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-31SUNRPC: Remove unused function rpc_task_reset_clientTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-27sunrpc: Use skcipher and ahash/shashHerbert Xu
This patch replaces uses of blkcipher with skcipher and the long obsolete hash interface with either shash (for non-SG users) and ahash. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-23Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Initial roundup of 4.5 merge window patches - Remove usage of ib_query_device and instead store attributes in ib_device struct - Move iopoll out of block and into lib, rename to irqpoll, and use in several places in the rdma stack as our new completion queue polling library mechanism. Update the other block drivers that already used iopoll to use the new mechanism too. - Replace the per-entry GID table locks with a single GID table lock - IPoIB multicast cleanup - Cleanups to the IB MR facility - Add support for 64bit extended IB counters - Fix for netlink oops while parsing RDMA nl messages - RoCEv2 support for the core IB code - mlx4 RoCEv2 support - mlx5 RoCEv2 support - Cross Channel support for mlx5 - Timestamp support for mlx5 - Atomic support for mlx5 - Raw QP support for mlx5 - MAINTAINERS update for mlx4/mlx5 - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates - Add support for remote invalidate to the iSER driver (pushed through the RDMA tree due to dependencies, acknowledged by nab) - Update to NFSoRDMA (pushed through the RDMA tree due to dependencies, acknowledged by Bruce)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits) IB/mlx5: Unify CQ create flags check IB/mlx5: Expose Raw Packet QP to user space consumers {IB, net}/mlx5: Move the modify QP operation table to mlx5_ib IB/mlx5: Support setting Ethernet priority for Raw Packet QPs IB/mlx5: Add Raw Packet QP query functionality IB/mlx5: Add create and destroy functionality for Raw Packet QP IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types IB/mlx5: Allocate a Transport Domain for each ucontext net/mlx5_core: Warn on unsupported events of QP/RQ/SQ net/mlx5_core: Add RQ and SQ event handling net/mlx5_core: Export transport objects IB/mlx5: Expose CQE version to user-space IB/mlx5: Add CQE version 1 support to user QPs and SRQs IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext IB/sa: Fix netlink local service GFP crash IB/srpt: Remove redundant wc array IB/qib: Improve ipoib UD performance IB/mlx4: Advertise RoCE v2 support IB/mlx4: Create and use another QP1 for RoCEv2 IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers ...
2016-01-22wrappers for ->i_mutex accessAl Viro
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-19svc_rdma: use local_dma_lkeyChristoph Hellwig
We now alwasy have a per-PD local_dma_lkey available. Make use of that fact in svc_rdma and stop registering our own MR. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Add class for RDMA backwards direction transportChuck Lever
To support the server-side of an NFSv4.1 backchannel on RDMA connections, add a transport class that enables backward direction messages on an existing forward channel connection. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Define maximum number of backchannel requestsChuck Lever
Extra resources for handling backchannel requests have to be pre-allocated when a transport instance is created. Set up additional fields in svcxprt_rdma to track these resources. The max_requests fields are elements of the RPC-over-RDMA protocol, so they should be u32. To ensure that unsigned arithmetic is used everywhere, some other fields in the svcxprt_rdma struct are updated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Make map_xdr non-staticChuck Lever
Pre-requisite to use map_xdr in the backchannel code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Remove last two __GFP_NOFAIL call sitesChuck Lever
Clean up. These functions can otherwise fail, so check for page allocation failures too. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Add gfp flags to svc_rdma_post_recv()Chuck Lever
svc_rdma_post_recv() allocates pages for receive buffers on-demand. It uses GFP_KERNEL so the allocator tries hard, and may sleep. But I'm about to add a call to svc_rdma_post_recv() from a function that may not sleep. Since all svc_rdma_post_recv() call sites can tolerate its failure, allow it to fail if the page allocator returns nothing. Longer term, receive buffers, being a finite resource per-connection, should be pre-allocated and re-used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Remove unused req_map and ctxt kmem_cachesChuck Lever
Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Improve allocation of struct svc_rdma_req_mapChuck Lever
To ensure this allocation cannot fail and will not sleep, pre-allocate the req_map structures per-connection. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Improve allocation of struct svc_rdma_op_ctxtChuck Lever
When the maximum payload size of NFS READ and WRITE was increased by commit cc9a903d915c ("svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOAD"), the size of struct svc_rdma_op_ctxt increased to over 6KB (on x86_64). That makes allocating one of these from a kmem_cache more likely to fail in situations when system memory is exhausted. Since I'm about to add a caller where this allocation must always work _and_ it cannot sleep, pre-allocate ctxts for each connection. Another motivation for this change is that NFSv4.x servers are required by specification not to drop NFS requests. Pre-allocating memory resources reduces the likelihood of a drop. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Clean up process_context()Chuck Lever
Be sure the completed ctxt is put in every path. The xprt enqueue can take a while, so put the completed ctxt back in circulation _before_ enqueuing the xprt. Remove/disable debugging. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-19svcrdma: Clean up rdma_create_xprt()Chuck Lever
kzalloc is used here, so setting the atomic fields to zero is unnecessary. sc_ord is set again in handle_connect_req. The other fields are re-initialized in svc_rdma_accept(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Bruce Fields <bfields@fieldses.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-01-15Merge tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Smaller bugfixes and cleanup, including a fix for a failures of kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK storms that can affect some high-availability NFS setups" * tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux: nfsd: add new io class tracepoint nfsd: give up on CB_LAYOUTRECALLs after two lease periods nfsd: Fix nfsd leaks sunrpc module references lockd: constify nlmsvc_binding structure lockd: use to_delayed_work nfsd: use to_delayed_work Revert "svcrdma: Do not send XDR roundup bytes for a write chunk" lockd: Register callbacks on the inetaddr_chain and inet6addr_chain nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain sunrpc: Add a function to close temporary transports immediately nfsd: don't base cl_cb_status on stale information nfsd4: fix gss-proxy 4.1 mounts for some AD principals nfsd: fix unlikely NULL deref in mach_creds_match nfsd: minor consolidation of mach_cred handling code nfsd: helper for dup of possibly NULL string svcrpc: move some initialization to common code nfsd: fix a warning message nfsd: constify nfsd4_callback_ops structure nfsd: recover: constify nfsd4_client_tracking_ops structures svcrdma: Do not send XDR roundup bytes for a write chunk
2016-01-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge first patch-bomb from Andrew Morton: - A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to -stable - A few misc fixes - OCFS2 updates - Part of MM. Including pretty large changes to page-flags handling and to thp management which have been buffered up for 2-3 cycles now. I have a lot of MM material this time. [ It turns out the THP part wasn't quite ready, so that got dropped from this series - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) zsmalloc: reorganize struct size_class to pack 4 bytes hole mm/zbud.c: use list_last_entry() instead of list_tail_entry() zram/zcomp: do not zero out zcomp private pages zram: pass gfp from zcomp frontend to backend zram: try vmalloc() after kmalloc() zram/zcomp: use GFP_NOIO to allocate streams mm: add tracepoint for scanning pages drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64 mm/page_isolation: use macro to judge the alignment mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE() mm: rework virtual memory accounting include/linux/memblock.h: fix ordering of 'flags' argument in comments mm: move lru_to_page to mm_inline.h Documentation/filesystems: describe the shared memory usage/accounting memory-hotplug: don't BUG() in register_memory_resource() hugetlb: make mm and fs code explicitly non-modular mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() mm: make sure isolate_lru_page() is never called for tail page vmstat: make vmstat_updater deferrable again and shut down on idle ...
2016-01-14kmemcg: account certain kmem allocations to memcgVladimir Davydov
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-07Merge branch 'bugfixes'Trond Myklebust
* bugfixes: SUNRPC: Fixup socket wait for memory SUNRPC: Fix a missing break in rpc_anyaddr() pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh() NFS: Fix attribute cache revalidation NFS: Ensure we revalidate attributes before using execute_ok() NFS: Flush reclaim writes using FLUSH_COND_STABLE NFS: Background flush should not be low priority NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn NFSv4: Don't perform cached access checks before we've OPENed the file NFS: Allow the combination pNFS and labeled NFS NFS42: handle layoutstats stateid error nfs: Fix race in __update_open_stateid() nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"J. Bruce Fields
This reverts commit 6f18dc893981e4daab29221d6a9771f3ce2dd8c5. Just as one example, it appears this code could do the wrong thing in the case of a two-byte NFS READ that crosses a page boundary. Chuck says: "In that case, nfsd would pass down an xdr_buf that has one byte in a page, one byte in another page, and a two-byte XDR pad. The logic introduced by this optimization would be fooled, and neither the second byte nor the XDR pad would be written to the client." Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-06SUNRPC: Fixup socket wait for memoryTrond Myklebust
We're seeing hangs in the NFS client code, with loops of the form: RPC: 30317 xmit incomplete (267368 left of 524448) RPC: 30317 call_status (status -11) RPC: 30317 call_transmit (status 0) RPC: 30317 xprt_prepare_transmit RPC: 30317 xprt_transmit(524448) RPC: xs_tcp_send_request(267368) = -11 RPC: 30317 xmit incomplete (267368 left of 524448) RPC: 30317 call_status (status -11) RPC: 30317 call_transmit (status 0) RPC: 30317 xprt_prepare_transmit RPC: 30317 xprt_transmit(524448) Turns out commit ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") moved SOCKWQ_ASYNC_NOSPACE out of sock->flags and into sk->sk_wq->flags, however it never tried to fix up the code in net/sunrpc. The new idiom is to use the flags in the RCU protected struct socket_wq. While we're at it, clear out the now redundant places where we set/clear SOCKWQ_ASYNC_NOSPACE and SOCK_NOSPACE. In principle, sk_stream_wait_memory() is supposed to set these for us, so we only need to clear them in the particular case of our ->write_space() callback. Fixes: ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") Cc: Eric Dumazet <edumazet@google.com> Cc: stable@vger.kernel.org # 4.4 Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>