summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma
AgeCommit message (Collapse)Author
2019-10-24xprtrdma: Wake tasks after connect worker failsChuck Lever
Pending tasks are currently never awoken when the connect worker fails. The reason is that XPRT_CONNECTED is always clear after a failure return of rpcrdma_ep_connect, thus the xprt_test_and_clear_connected() check in xprt_rdma_connect_worker() always fails. - xprt_rdma_close always clears XPRT_CONNECTED. - rpcrdma_ep_connect always clears XPRT_CONNECTED. After reviewing the TCP connect worker, it appears that there's no need for extra test_and_set paranoia in xprt_rdma_connect_worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Pull up sometimesChuck Lever
On some platforms, DMA mapping part of a page is more costly than copying bytes. Restore the pull-up code and use that when we think it's going to be faster. The heuristic for now is to pull-up when the size of the RPC message body fits in the buffer underlying the head iovec. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated, as is handling a Send completion, for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Refactor rpcrdma_prepare_msg_sges()Chuck Lever
Refactor: Replace spaghetti with code that makes it plain what needs to be done for each rtype. This makes it easier to add features and optimizations. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Move the rpcrdma_sendctx::sc_wr fieldChuck Lever
Clean up: This field is not needed in the Send completion handler, so it can be moved to struct rpcrdma_req to reduce the size of struct rpcrdma_sendctx, and to reduce the amount of memory that is sloshed between the sending process and the Send completion process. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Remove rpcrdma_sendctx::sc_deviceChuck Lever
Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Remove rpcrdma_sendctx::sc_xprtChuck Lever
Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Ensure ri_id is stable during MR recyclingChuck Lever
ia->ri_id is replaced during a reconnect. The connect_worker runs with the transport send lock held to prevent ri_id from being dereferenced by the send_request path during this process. Currently, however, there is no guarantee that ia->ri_id is stable in the MR recycling worker, which operates in the background and is not serialized with the connect_worker in any way. But now that Local_Inv completions are being done in process context, we can handle the recycling operation there instead of deferring the recycling work to another process. Because the disconnect path drains all work before allowing tear down to proceed, it is guaranteed that Local Invalidations complete only while the ri_id pointer is stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Manage MRs in context of a single connectionChuck Lever
MRs are now allocated on demand so we can safely throw them away on disconnect. This way an idle transport can disconnect and it won't pin hardware MR resources. Two additional changes: - Now that all MRs are destroyed on disconnect, there's no need to check during header marshaling if a req has MRs to recycle. Each req is sent only once per connection, and now rl_registered is guaranteed to be empty when rpcrdma_marshal_req is invoked. - Because MRs are now destroyed in a WQ_MEM_RECLAIM context, they also must be allocated in a WQ_MEM_RECLAIM context. This reduces the likelihood that device driver memory allocation will trigger memory reclaim during NFS writeback. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Fix MR list handlingChuck Lever
Close some holes introduced by commit 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") that could result in list corruption. In addition, the result that is tabulated in @count is no longer used, so @count is removed. Fixes: 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Close window between waking RPC senders and posting ReceivesChuck Lever
A recent clean up attempted to separate Receive handling and RPC Reply processing, in the name of clean layering. Unfortunately, we can't do this because the Receive Queue has to be refilled _after_ the most recent credit update from the responder is parsed from the transport header, but _before_ we wake up the next RPC sender. That is right in the middle of rpcrdma_reply_handler(). Usually this isn't a problem because current responder implementations don't vary their credit grant. The one exception is when a connection is established: the grant goes from one to a much larger number on the first Receive. The requester MUST post enough Receives right then so that any outstanding requests can be sent without risking RNR and connection loss. Fixes: 6ceea36890a0 ("xprtrdma: Refactor Receive accounting") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Initialize rb_credits in one placeChuck Lever
Clean up/code de-duplication. Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd. This mistake does not appear to have operational consequences, since the cwnd value is replaced with a valid value upon the first Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Connection becomes unstable after a reconnectChuck Lever
This is because xprt_request_get_cong() is allowing more than one RPC Call to be transmitted before the first Receive on the new connection. The first Receive fills the Receive Queue based on the server's credit grant. Before that Receive, there is only a single Receive WR posted because the client doesn't know the server's credit grant. Solution is to clear rq_cong on all outstanding rpc_rqsts when the the cwnd is reset. This is because an RPC/RDMA credit is good for one connection instance only. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-24xprtrdma: Add unique trace points for posting Local Invalidate WRsChuck Lever
When adding frwr_unmap_async way back when, I re-used the existing trace_xprtrdma_post_send() trace point to record the return code of ib_post_send. Unfortunately there are some cases where re-using that trace point causes a crash. Instead, construct a trace point specific to posting Local Invalidate WRs that will always be safe to use in that context, and will act as a trace log eye-catcher for Local Invalidation. Fixes: 847568942f93 ("xprtrdma: Remove fr_state") Fixes: d8099feda483 ("xprtrdma: Reduce context switching due ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Bill Baker <bill.baker@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-08svcrdma: Improve DMA mapping trace pointsChuck Lever
Capture the total size of Sends, the size of DMA map and the matching DMA unmap to ensure operation is correct. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-09-27Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Highlights: - Add a new knfsd file cache, so that we don't have to open and close on each (NFSv2/v3) READ or WRITE. This can speed up read and write in some cases. It also replaces our readahead cache. - Prevent silent data loss on write errors, by treating write errors like server reboots for the purposes of write caching, thus forcing clients to resend their writes. - Tweak the code that allocates sessions to be more forgiving, so that NFSv4.1 mounts are less likely to hang when a server already has a lot of clients. - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be limited only by the backend filesystem and the maximum RPC size. - Allow the server to enforce use of the correct kerberos credentials when a client reclaims state after a reboot. And some miscellaneous smaller bugfixes and cleanup" * tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits) sunrpc: clean up indentation issue nfsd: fix nfs read eof detection nfsd: Make nfsd_reset_boot_verifier_locked static nfsd: degraded slot-count more gracefully as allocation nears exhaustion. nfsd: handle drc over-allocation gracefully. nfsd: add support for upcall version 2 nfsd: add a "GetVersion" upcall for nfsdcld nfsd: Reset the boot verifier on all write I/O errors nfsd: Don't garbage collect files that might contain write errors nfsd: Support the server resetting the boot verifier nfsd: nfsd_file cache entries should be per net namespace nfsd: eliminate an unnecessary acl size limit Deprecate nfsd fault injection nfsd: remove duplicated include from filecache.c nfsd: Fix the documentation for svcxdr_tmpalloc() nfsd: Fix up some unused variable warnings nfsd: close cached files prior to a REMOVE or RENAME that would replace target nfsd: rip out the raparms cache nfsd: have nfsd_test_lock use the nfsd_file cache nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache ...
2019-09-26Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - Dequeue the request from the receive queue while we're re-encoding # v4.20+ - Fix buffer handling of GSS MIC without slack # 5.1 Features: - Increase xprtrdma maximum transport header and slot table sizes - Add support for nfs4_call_sync() calls using a custom rpc_task_struct - Optimize the default readahead size - Enable pNFS filelayout LAYOUTGET on OPEN Other bugfixes and cleanups: - Fix possible null-pointer dereferences and memory leaks - Various NFS over RDMA cleanups - Various NFS over RDMA comment updates - Don't receive TCP data into a reset request buffer - Don't try to parse incomplete RPC messages - Fix congestion window race with disconnect - Clean up pNFS return-on-close error handling - Fixes for NFS4ERR_OLD_STATEID handling" * tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (53 commits) pNFS/filelayout: enable LAYOUTGET on OPEN NFS: Optimise the default readahead size NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE NFSv4: Fix OPEN_DOWNGRADE error handling pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid NFSv4: Add a helper to increment stateid seqids NFSv4: Handle RPC level errors in LAYOUTRETURN NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close NFSv4: Clean up pNFS return-on-close error handling pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors NFS: remove unused check for negative dentry NFSv3: use nfs_add_or_obtain() to create and reference inodes NFS: Refactor nfs_instantiate() for dentry referencing callers SUNRPC: Fix congestion window race with disconnect SUNRPC: Don't try to parse incomplete RPC messages SUNRPC: Rename xdr_buf_read_netobj to xdr_buf_read_mic SUNRPC: Fix buffer handling of GSS MIC without slack SUNRPC: RPC level errors should always set task->tk_rpc_status SUNRPC: Don't receive TCP data into a request buffer that has been reset ...
2019-08-26xprtrdma: Send Queue size grows after a reconnectChuck Lever
Eli Dorfman reports that after a series of idle disconnects, an RPC/RDMA transport becomes unusable (rdma_create_qp returns -ENOMEM). Problem was tracked down to increasing Send Queue size after each reconnect. The rdma_create_qp() API does not promise to leave its @qp_init_attr parameter unaltered. In fact, some drivers do modify one or more of its fields. Thus our calls to rdma_create_qp must use a fresh copy of ib_qp_init_attr each time. This fix is appropriate for kernels dating back to late 2007, though it will have to be adapted, as the connect code has changed over the years. Reported-by: Eli Dorfman <eli@vastdata.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26xprtrdma: Clear xprt->reestablish_timeout on closeChuck Lever
Ensure that the re-establishment delay does not grow exponentially on each good reconnect. This probably should have been part of commit 675dd90ad093 ("xprtrdma: Modernize ops->connect"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-26xprtrdma: Recycle MRs after disconnectChuck Lever
The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a bit too optimistic. MRs left over after a reconnect still need to be recycled, not added back to the free list, since they could be in flight or actually fully registered. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Optimize rpcrdma_post_recvs()Chuck Lever
Micro-optimization: In rpcrdma_post_recvs, since commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)"), the common case is to return without doing anything. Found with perf. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Inline XDR chunk encoder functionsChuck Lever
Micro-optimization: Save the cost of three function calls during transport header encoding. These were "noinline" before to generate more meaningful call stacks during debugging, but this code is now pretty stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Fix bc_max_slots return valueChuck Lever
For the moment the returned value just happens to be correct because the current backchannel server implementation does not vary the number of credits it offers. The spec does permit this value to change during the lifetime of a connection, however. The actual maximum is fixed for all RPC/RDMA transports, because each transport instance has to pre-allocate the resources for processing BC requests. That's the value that should be returned. Fixes: 7402a4fedc2b ("SUNRPC: Fix up backchannel slot table ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Clean up xprt_rdma_set_connect_timeout()Chuck Lever
Clean up: The function name should match the documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Use an llist to manage free rpcrdma_repsChuck Lever
rpcrdma_rep objects are removed from their free list by only a single thread: the Receive completion handler. Thus that free list can be converted to an llist, where a single-threaded consumer and a multi-threaded producer (rpcrdma_buffer_put) can both access the llist without the need for any serialization. This eliminates spin lock contention between the Receive completion handler and rpcrdma_buffer_get, and makes the rep consumer wait- free. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Remove rpcrdma_buffer::rb_mrlockChuck Lever
Clean up: Now that the free list is used sparingly, get rid of the separate spin lock protecting it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21xprtrdma: Cache free MRs in each rpcrdma_reqChuck Lever
Instead of a globally-contended MR free list, cache MRs in each rpcrdma_req as they are released. This means acquiring and releasing an MR will be lock-free in the common case, even outside the transport send lock. The original idea of per-rpcrdma_req MR free lists was suggested by Shirley Ma <shirley.ma@oracle.com> several years ago. I just now figured out how to make that idea work with on-demand MR allocation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Ensure creating an MR does not trigger FS writebackChuck Lever
Probably would be good to also pass GFP flags to ib_alloc_mr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Move rpcrdma_mr_get out of frwr_mapChuck Lever
Refactor: Retrieve an MR and handle error recovery entirely in rpc_rdma.c, as this is not a device-specific function. Note that since commit 89f90fe1ad8b ("SUNRPC: Allow calls to xprt_transmit() to drain the entire transmit queue"), the xprt_transmit function handles the cond_resched. The transport no longer has to do this itself. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Combine rpcrdma_mr_put and rpcrdma_mr_unmap_and_putChuck Lever
Clean up. There is only one remaining rpcrdma_mr_put call site, and it can be directly replaced with unmap_and_put because mr->mr_dir is set to DMA_NONE just before the call. Now all the call sites do a DMA unmap, and we can just rename mr_unmap_and_put to mr_put, which nicely matches mr_get. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Simplify rpcrdma_mr_popChuck Lever
Clean up: rpcrdma_mr_pop call sites check if the list is empty first. Let's replace the list_empty with less costly logic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Toggle XPRT_CONGESTED in xprtrdma's slot methodsChuck Lever
Commit 48be539dd44a ("xprtrdma: Introduce ->alloc_slot call-out for xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA transport. Later, commit 75891f502f5f ("SUNRPC: Support for congestion control when queuing is enabled") modified the generic alloc/free_slot methods, but neglected the methods in xprtrdma. Found via code review. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Rename rpcrdma_buffer::rb_allChuck Lever
Clean up: There are other "all" list heads. For code clarity distinguish this one as for use only for MRs by renaming it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Rename CQE field in Receive trace pointsChuck Lever
Make the field name the same for all trace points that handle pointers to struct rpcrdma_rep. That makes it easy to grep for matching rep points in trace output. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Boost maximum transport header sizeChuck Lever
Although I haven't seen any performance results that justify it, I've received several complaints that NFS/RDMA no longer supports a maximum rsize and wsize of 1MB. These days it is somewhat smaller. To simplify the logic that determines whether a chunk list is necessary, the implementation uses a fixed maximum size of the transport header. Currently that maximum size is 256 bytes, one quarter of the default inline threshold size for RPC/RDMA v1. Since commit a78868497c2e ("xprtrdma: Reduce max_frwr_depth"), the size of chunks is also smaller to take advantage of inline page lists in device internal MR data structures. The combination of these two design choices has reduced the maximum NFS rsize and wsize that can be used for most RNIC/HCAs. Increasing the maximum transport header size and the maximum number of RDMA segments it can contain increases the negotiated maximum rsize/wsize on common RNIC/HCAs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Fix calculation of ri_max_segs againChuck Lever
Commit 302d3deb206 ("xprtrdma: Prevent inline overflow") added this calculation back in 2016, but got it wrong. I tested only the lower bound, which is why there is a max_t there. The upper bound should be rounded up too. Now, when using DIV_ROUND_UP, that takes care of the lower bound as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Update obsolete commentChuck Lever
Comment was made obsolete by commit 8cec3dba76a4 ("xprtrdma: rpcrdma_regbuf alignment"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20xprtrdma: Refresh the documenting comment in frwr_ops.cChuck Lever
Things have changed since this comment was written. In particular, the reworking of connection closing, on-demand creation of MRs, and the removal of fr_state all mean that deferring MR recovery to frwr_map is no longer needed. The description is obsolete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-19svcrdma: Use llist for managing cache of recv_ctxtsChuck Lever
Use a wait-free mechanism for managing the svc_rdma_recv_ctxts free list. Subsequently, sc_recv_lock can be eliminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-19svcrdma: Remove svc_rdma_wqChuck Lever
Clean up: the system workqueue will work just as well. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-08-05rdma: Enable ib_alloc_cq to spread work over a device's comp_vectorsChuck Lever
Send and Receive completion is handled on a single CPU selected at the time each Completion Queue is allocated. Typically this is when an initiator instantiates an RDMA transport, or when a target accepts an RDMA connection. Some ULPs cannot open a connection per CPU to spread completion workload across available CPUs and MSI vectors. For such ULPs, provide an API that allows the RDMA core to select a completion vector based on the device's complement of available comp_vecs. ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted to use the new API so that their completion workloads interfere less with each other. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Cc: <linux-cifs@vger.kernel.org> Cc: <v9fs-developer@lists.sourceforge.net> Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.net Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-07-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge yet more updates from Andrew Morton: "The rest of MM and a kernel-wide procfs cleanup. Summary of the more significant patches: - Patch series "mm/memory_hotplug: Factor out memory block devicehandling", v3. David Hildenbrand. Some spring-cleaning of the memory hotplug code, notably in drivers/base/memory.c - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang Shi. Fix /proc/pid/smaps output for THP pages used in shmem. - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit. Bugfix and speedup for kernel/resource.c - Patch series "mm: Further memory block device cleanups", David Hildenbrand. More spring-cleaning of the memory hotplug code. - Patch series "mm: Sub-section memory hotplug support". Dan Williams. Generalise the memory hotplug code so that pmem can use it more completely. Then remove the hacks from the libnvdimm code which were there to work around the memory-hotplug code's constraints. - "proc/sysctl: add shared variables for range check", Matteo Croce. We have about 250 instances of int zero; ... .extra1 = &zero, in the tree. This is a tree-wide sweep to make all those private "zero"s and "one"s use global variables. Alas, it isn't practical to make those two global integers const" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits) proc/sysctl: add shared variables for range check mm: migrate: remove unused mode argument mm/sparsemem: cleanup 'section number' data types libnvdimm/pfn: stop padding pmem namespaces to section alignment libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields mm/devm_memremap_pages: enable sub-section remap mm: document ZONE_DEVICE memory-model implications mm/sparsemem: support sub-section hotplug mm/sparsemem: prepare for sub-section ranges mm: kill is_dev_zone() helper mm/hotplug: kill is_dev_zone() usage in __remove_pages() mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal mm/sparsemem: add helpers track active portions of a section at boot mm/sparsemem: introduce a SECTION_IS_EARLY flag mm/sparsemem: introduce struct mem_section_usage drivers/base/memory.c: get rid of find_memory_block_hinted() mm/memory_hotplug: move and simplify walk_memory_blocks() mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns mm: make register_mem_sect_under_node() static ...
2019-07-18proc/sysctl: add shared variables for range checkMatteo Croce
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request - Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference - Revert buggy NFS readdirplus optimisation - NFSv4: Handle the special Linux file open access mode - pnfs: Fix a problem where we gratuitously start doing I/O through the MDS Features: - Allow NFS client to set up multiple TCP connections to the server using a new 'nconnect=X' mount option. Queue length is used to balance load. - Enhance statistics reporting to report on all transports when using multiple connections. - Speed up SUNRPC by removing bh-safe spinlocks - Add a mechanism to allow NFSv4 to request that containers set a unique per-host identifier for when the hostname is not set. - Ensure NFSv4 updates the lease_time after a clientid update Bugfixes and cleanup: - Fix use-after-free in rpcrdma_post_recvs - Fix a memory leak when nfs_match_client() is interrupted - Fix buggy file access checking in NFSv4 open for execute - disable unsupported client side deduplication - Fix spurious client disconnections - Fix occasional RDMA transport deadlock - Various RDMA cleanups - Various tracepoint fixes - Fix the TCP callback channel to guarantee the server can actually send the number of callback requests that was negotiated at mount time" * tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits) pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS pnfs: Fix a problem where we gratuitously start doing I/O through the MDS SUNRPC: Optimise transport balancing code SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error NFSv4: Don't use the zero stateid with layoutget SUNRPC: Fix up backchannel slot table accounting SUNRPC: Fix initialisation of struct rpc_xprt_switch SUNRPC: Skip zero-refcount transports SUNRPC: Replace division by multiplication in calculation of queue length NFSv4: Validate the stateid before applying it to state recovery nfs4.0: Refetch lease_time after clientid update nfs4: Rename nfs41_setup_state_renewal nfs4: Make nfs4_proc_get_lease_time available for nfs4.0 nfs: Fix copy-and-paste error in debug message NFS: Replace 16 seq_printf() calls by seq_puts() NFS: Use seq_putc() in nfs_show_stats() Revert "NFS: readdirplus optimization by cache mechanism" (memleak) SUNRPC: Fix transport accounting when caller specifies an rpc_xprt NFS: Record task, client ID, and XID in xdr_status trace points ...
2019-07-18SUNRPC: Fix up backchannel slot table accountingTrond Myklebust
Add a per-transport maximum limit in the socket case, and add helpers to allow the NFSv4 code to discover that limit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-12Merge tag 'nfs-rdma-for-5.3-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
2019-07-11Merge tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI scatter-gather list updates from James Bottomley: "This topic branch covers a fundamental change in how our sg lists are allocated to make mq more efficient by reducing the size of the preallocated sg list. This necessitates a large number of driver changes because the previous guarantee that if a driver specified SG_ALL as the size of its scatter list, it would get a non-chained list and didn't need to bother with scatterlist iterators is now broken and every driver *must* use scatterlist iterators. This was broken out as a separate topic because we need to convert all the drivers before pulling the trigger and unconverted drivers kept being found, necessitating a rebase" * tag 'scsi-sg' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (21 commits) scsi: core: don't preallocate small SGL in case of NO_SG_CHAIN scsi: lib/sg_pool.c: clear 'first_chunk' in case of no preallocation scsi: core: avoid preallocating big SGL for data scsi: core: avoid preallocating big SGL for protection information scsi: lib/sg_pool.c: improve APIs for allocating sg pool scsi: esp: use sg helper to iterate over scatterlist scsi: NCR5380: use sg helper to iterate over scatterlist scsi: wd33c93: use sg helper to iterate over scatterlist scsi: ppa: use sg helper to iterate over scatterlist scsi: pcmcia: nsp_cs: use sg helper to iterate over scatterlist scsi: imm: use sg helper to iterate over scatterlist scsi: aha152x: use sg helper to iterate over scatterlist scsi: s390: zfcp_fc: use sg helper to iterate over scatterlist scsi: staging: unisys: visorhba: use sg helper to iterate over scatterlist scsi: usb: image: microtek: use sg helper to iterate over scatterlist scsi: pmcraid: use sg helper to iterate over scatterlist scsi: ipr: use sg helper to iterate over scatterlist scsi: mvumi: use sg helper to iterate over scatterlist scsi: lpfc: use sg helper to iterate over scatterlist scsi: advansys: use sg helper to iterate over scatterlist ...
2019-07-09xprtrdma: Modernize ops->connectChuck Lever
Adapt and apply changes that were made to the TCP socket connect code. See the following commits for details on the purpose of these changes: Commit 7196dbb02ea0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly") Commit 3851f1cdb2b8 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout") Commit 02910177aede ("SUNRPC: Fix reconnection timeouts") Some common transport code is moved to xprt.c to satisfy the code duplication police. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove rpcrdma_req::rl_bufferChuck Lever
Clean up. There is only one remaining function, rpcrdma_buffer_put(), that uses this field. Its caller can supply a pointer to the correct rpcrdma_buffer, enabling the removal of an 8-byte pointer field from a frequently-allocated shared data structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Refactor chunk encodingChuck Lever
Clean up. Move the "not present" case into the individual chunk encoders. This improves code organization and readability. The reason for the original organization was to optimize for the case where there there are no chunks. The optimization turned out to be inconsequential, so let's err on the side of code readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Streamline rpcrdma_post_recvsChuck Lever
rb_lock is contended between rpcrdma_buffer_create, rpcrdma_buffer_put, and rpcrdma_post_recvs. Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") causes rpcrdma_post_recvs to take the rb_lock repeatedly when it determines more Receives are needed. Streamline this code path so it takes the lock just once in most cases to build the Receive chain that is about to be posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>