summaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)Author
2020-05-11Merge tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds
Pull nfsd fixes from Chuck Lever: "Resolve a data integrity problem with NFSD that I inadvertently introduced last year. The change I made makes the NFS server's duplicate reply cache ineffective when krb5i or krb5p are in use, thus allowing the replay of non-idempotent NFS requests such as RENAME, SETATTR, or even WRITEs" * tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6: SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()") SUNRPC: Fix GSS privacy computation of auth->au_ralign SUNRPC: Add "@len" parameter to gss_unwrap()
2020-05-02Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - fix handling of backchannel binding in BIND_CONN_TO_SESSION Bugfixes: - Fix a credential use-after-free issue in pnfs_roc() - Fix potential posix_acl refcnt leak in nfs3_set_acl - defer slow parts of rpc_free_client() to a workqueue - Fix an Oopsable race in __nfs_list_for_each_server() - Fix trace point use-after-free race - Regression: the RDMA client no longer responds to server disconnect requests - Fix return values of xdr_stream_encode_item_{present, absent} - _pnfs_return_layout() must always wait for layoutreturn completion Cleanups: - Remove unreachable error conditions" * tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix a race in __nfs_list_for_each_server() NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION SUNRPC: defer slow parts of rpc_free_client() to a workqueue. NFSv4: Remove unreachable error condition due to rpc_run_task() SUNRPC: Remove unreachable error condition xprtrdma: Fix use of xdr_stream_encode_item_{present, absent} xprtrdma: Fix trace point use-after-free race xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler() nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc() NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
2020-04-28Merge tag 'nfs-rdma-for-5.7-2' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA Client Fixes for Linux 5.7 Bugfixes: - Restore wake-up-all to rpcrdma_cm_event_handler() - Otherwise the client won't respond to server disconnect requests - Fix tracepoint use-after-free race - Fix usage of xdr_stream_encode_item_{present, absent} - These functions return a size on success, and not 0 Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28SUNRPC: defer slow parts of rpc_free_client() to a workqueue.NeilBrown
The rpciod workqueue is on the write-out path for freeing dirty memory, so it is important that it never block waiting for memory to be allocated - this can lead to a deadlock. rpc_execute() - which is often called by an rpciod work item - calls rcp_task_release_client() which can lead to rpc_free_client(). rpc_free_client() makes two calls which could potentially block wating for memory allocation. rpc_clnt_debugfs_unregister() calls into debugfs and will block while any of the debugfs files are being accessed. In particular it can block while any of the 'open' methods are being called and all of these use malloc for one thing or another. So this can deadlock if the memory allocation waits for NFS to complete some writes via rpciod. rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't obvious that memory allocations can happen while the lock it held, it is safer to assume they might and to not let rpciod call rpc_clnt_remove_pipedir(). So this patch moves these two calls (together with the final kfree() and rpciod_down()) into a work-item to be run from the system work-queue. rpciod can continue its important work, and the final stages of the free can happen whenever they happen. I have seen this deadlock on a 4.12 based kernel where debugfs used synchronize_srcu() when removing objects. synchronize_srcu() requires a workqueue and there were no free workther threads and none could be allocated. While debugsfs no longer uses SRCU, I believe the deadlock is still possible. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-27SUNRPC: Revert 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()")Chuck Lever
I've noticed that when krb5i or krb5p security is in use, retransmitted requests are missing the server's duplicate reply cache. The computed checksum on the retransmitted request does not match the cached checksum, resulting in the server performing the retransmitted request again instead of returning the cached reply. The assumptions made when removing xdr_buf_trim() were not correct. In the send paths, the upper layer has already set the segment lengths correctly, and shorting the buffer's content is simply a matter of reducing buf->len. xdr_buf_trim() is the right answer in the receive/unwrap path on both the client and the server. The buffer segment lengths have to be shortened one-by-one. On the server side in particular, head.iov_len needs to be updated correctly to enable nfsd_cache_csum() to work correctly. The simple buf->len computation doesn't do that, and that results in checksumming stale data in the buffer. The problem isn't noticed until there's significant instability of the RPC transport. At that point, the reliability of retransmit detection on the server becomes crucial. Fixes: 241b1f419f0e ("SUNRPC: Remove xdr_buf_trim()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27SUNRPC: Fix GSS privacy computation of auth->au_ralignChuck Lever
When the au_ralign field was added to gss_unwrap_resp_priv, the wrong calculation was used. Setting au_rslack == au_ralign is probably correct for kerberos_v1 privacy, but kerberos_v2 privacy adds additional GSS data after the clear text RPC message. au_ralign needs to be smaller than au_rslack in that fairly common case. When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does exactly what I feared it would: it trims off part of the clear text RPC message. However, that's because rpc_prepare_reply_pages() does not set up the rq_rcv_buf's tail correctly because au_ralign is too large. Fixing the au_ralign computation also corrects the alignment of rq_rcv_buf->pages so that the client does not have to shift reply data payloads after they are received. Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27SUNRPC: Add "@len" parameter to gss_unwrap()Chuck Lever
Refactor: This is a pre-requisite to fixing the client-side ralign computation in gss_unwrap_resp_priv(). The length value is passed in explicitly rather that as the value of buf->len. This will subsequently allow gss_unwrap_kerberos_v1() to compute a slack and align value, instead of computing it in gss_unwrap_resp_priv(). Fixes: 35e77d21baa0 ("SUNRPC: Add rpc_auth::au_ralign field") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-21SUNRPC: Remove unreachable error conditionXiyu Yang
rpc_clnt_test_and_add_xprt() invokes rpc_call_null_helper(), which return the value of rpc_run_task() to "task". Since rpc_run_task() is impossible to return an ERR pointer, there is no need to add the IS_ERR() condition on "task" here. So we need to remove it. Fixes: 7f554890587c ("SUNRPC: Allow addition of new transports to a struct rpc_clnt") Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-20xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}Chuck Lever
These new helpers do not return 0 on success, they return the encoded size. Thus they are not a drop-in replacement for the old helpers. Fixes: 5c266df52701 ("SUNRPC: Add encoders for list item discriminators") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20xprtrdma: Fix trace point use-after-free raceChuck Lever
It's not safe to use resources pointed to by the @send_wr of ib_post_send() _after_ that function returns. Those resources are typically freed by the Send completion handler, which can run before ib_post_send() returns. Thus the trace points currently around ib_post_send() in the client's RPC/RDMA transport are a hazard, even when they are disabled. Rearrange them so that they touch the Work Request only _before_ ib_post_send() is invoked. Fixes: ab03eff58eb5 ("xprtrdma: Add trace points in RPC Call transmit paths") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()Chuck Lever
Commit e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") erroneously removed a xprt_force_disconnect() call from the "transport disconnect" path. The result was that the client no longer responded to server-side disconnect requests. Restore that call. Fixes: e28ce90083f0 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-17svcrdma: Fix leak of svc_rdma_recv_ctxt objectsChuck Lever
Utilize the xpo_release_rqst transport method to ensure that each rqstp's svc_rdma_recv_ctxt object is released even when the server cannot return a Reply for that rqstp. Without this fix, each RPC whose Reply cannot be sent leaks one svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped Receive buffer, and any pages that might be part of the Reply message. The leak is infrequent unless the network fabric is unreliable or Kerberos is in use, as GSS sequence window overruns, which result in connection loss, are more common on fast transports. Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17svcrdma: Fix trace point use-after-free raceChuck Lever
I hit this while testing nfsd-5.7 with kernel memory debugging enabled on my server: Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060 Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591 Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma] Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 <8b> 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00 Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286 Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030 Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246 Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004 Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80 Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000 Mar 30 13:21:45 klimt kernel: FS: 0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000 Mar 30 13:21:45 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0 Mar 30 13:21:45 klimt kernel: Call Trace: Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma] Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma] Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd] Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc] Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd] Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74 Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50 Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]--- It's absolutely not safe to use resources pointed to by the @send_wr argument of ib_post_send() _after_ that function returns. Those resources are typically freed by the Send completion handler, which can run before ib_post_send() returns. Thus the trace points currently around ib_post_send() in the server's RPC/RDMA transport are a hazard, even when they are disabled. Rearrange them so that they touch the Work Request only _before_ ib_post_send() is invoked. Fixes: bd2abef33394 ("svcrdma: Trace key RDMA API events") Fixes: 4201c7464753 ("svcrdma: Introduce svc_rdma_send_ctxt") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17SUNRPC: Fix backchannel RPC soft lockupsChuck Lever
Currently, after the forward channel connection goes away, backchannel operations are causing soft lockups on the server because call_transmit_status's SOFTCONN logic ignores ENOTCONN. Such backchannel Calls are aggressively retried until the client reconnects. Backchannel Calls should use RPC_TASK_NOCONNECT rather than RPC_TASK_SOFTCONN. If there is no forward connection, the server is not capable of establishing a connection back to the client, thus that backchannel request should fail before the server attempts to send it. Commit 58255a4e3ce5 ("NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN") was merged several years before RPC_TASK_NOCONNECT was available. Because setup_callback_client() explicitly sets NOPING, the NFSv4.0 callback connection depends on the first callback RPC to initiate a connection to the client. Thus NFSv4.0 needs to continue to use RPC_TASK_SOFTCONN. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # v4.20+
2020-04-13SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purgeYihao Wu
Deleting list entry within hlist_for_each_entry_safe is not safe unless next pointer (tmp) is protected too. It's not, because once hash_lock is released, cache_clean may delete the entry that tmp points to. Then cache_purge can walk to a deleted entry and tries to double free it. Fix this bug by holding only the deleted entry's reference. Suggested-by: NeilBrown <neilb@suse.de> Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NeilBrown <neilb@suse.de> [ cel: removed unused variable ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-07Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a page leak in nfs_destroy_unlinked_subrequests() - Fix use-after-free issues in nfs_pageio_add_request() - Fix new mount code constant_table array definitions - finish_automount() requires us to hold 2 refs to the mount record Features: - Improve the accuracy of telldir/seekdir by using 64-bit cookies when possible. - Allow one RDMA active connection and several zombie connections to prevent blocking if the remote server is unresponsive. - Limit the size of the NFS access cache by default - Reduce the number of references to credentials that are taken by NFS - pNFS files and flexfiles drivers now support per-layout segment COMMIT lists. - Enable partial-file layout segments in the pNFS/flexfiles driver. - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from the DS using the layouterror mechanism. Bugfixes and cleanups: - SUNRPC: Fix krb5p regressions - Don't specify NFS version in "UDP not supported" error - nfsroot: set tcp as the default transport protocol - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid() - alloc_nfs_open_context() must use the file cred when available - Fix locking when dereferencing the delegation cred - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails - Various clean ups of the NFS O_DIRECT commit code - Clean up RDMA connect/disconnect - Replace zero-length arrays with C99-style flexible arrays" * tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits) NFS: Clean up process of marking inode stale. SUNRPC: Don't start a timer on an already queued rpc task NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn() NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode() NFS: Beware when dereferencing the delegation cred NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout NFS: finish_automount() requires us to hold 2 refs to the mount record NFS: Fix a few constant_table array definitions NFS: Try to join page groups before an O_DIRECT retransmission NFS: Refactor nfs_lock_and_join_requests() NFS: Reverse the submission order of requests in __nfs_pageio_add_request() NFS: Clean up nfs_lock_and_join_requests() NFS: Remove the redundant function nfs_pgio_has_mirroring() NFS: Fix memory leaks in nfs_pageio_stop_mirroring() NFS: Fix a request reference leak in nfs_direct_write_clear_reqs() NFS: Fix use-after-free issues in nfs_pageio_add_request() NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests() NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio() pNFS/flexfiles: Specify the layout segment range in LAYOUTGET ...
2020-04-04SUNRPC: Don't start a timer on an already queued rpc taskTrond Myklebust
Move the test for whether a task is already queued to prevent corruption of the timer list in __rpc_sleep_on_priority_timeout(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-28Merge tag 'nfs-rdma-for-5.7-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA Client Updates for Linux 5.7 New Features: - Allow one active connection and several zombie connections to prevent blocking if the remote server is unresponsive. Bugfixes and Cleanups: - Enhance MR-related trace points - Refactor connection set-up and disconnect functions - Make Protection Domains per-connection instead of per-transport - Merge struct rpcrdma_ia into rpcrdma_ep
2020-03-27svcrdma: Fix leak of transport addressesChuck Lever
Kernel memory leak detected: unreferenced object 0xffff888849cdf480 (size 8): comm "kworker/u8:3", pid 2086, jiffies 4297898756 (age 4269.856s) hex dump (first 8 bytes): 30 00 cd 49 88 88 ff ff 0..I.... backtrace: [<00000000acfc370b>] __kmalloc_track_caller+0x137/0x183 [<00000000a2724354>] kstrdup+0x2b/0x43 [<0000000082964f84>] xprt_rdma_format_addresses+0x114/0x17d [rpcrdma] [<00000000dfa6ed00>] xprt_setup_rdma_bc+0xc0/0x10c [rpcrdma] [<0000000073051a83>] xprt_create_transport+0x3f/0x1a0 [sunrpc] [<0000000053531a8e>] rpc_create+0x118/0x1cd [sunrpc] [<000000003a51b5f8>] setup_callback_client+0x1a5/0x27d [nfsd] [<000000001bd410af>] nfsd4_process_cb_update.isra.7+0x16c/0x1ac [nfsd] [<000000007f4bbd56>] nfsd4_run_cb_work+0x4c/0xbd [nfsd] [<0000000055c5586b>] process_one_work+0x1b2/0x2fe [<00000000b1e3e8ef>] worker_thread+0x1a6/0x25a [<000000005205fb78>] kthread+0xf6/0xfb [<000000006d2dc057>] ret_from_fork+0x3a/0x50 Introduce a call to xprt_rdma_free_addresses() similar to the way that the TCP backchannel releases a transport's peer address strings. Fixes: 5d252f90a800 ("svcrdma: Add class for RDMA backwards direction transport") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'Christophe JAILLET
'maxlen' is the total size of the destination buffer. There is only one caller and this value is 256. When we compute the size already used and what we would like to add in the buffer, the trailling NULL character is not taken into account. However, this trailling character will be added by the 'strcat' once we have checked that we have enough place. So, there is a off-by-one issue and 1 byte of the stack could be erroneously overwridden. Take into account the trailling NULL, when checking if there is enough place in the destination buffer. While at it, also replace a 'sprintf' by a safer 'snprintf', check for output truncation and avoid a superfluous 'strlen'. Fixes: dc9a16e49dbba ("svc: Add /proc/sys/sunrpc/transport files") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> [ cel: very minor fix to documenting comment Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprtChuck Lever
Change the rpcrdma_xprt_disconnect() function so that it no longer waits for the DISCONNECTED event. This prevents blocking if the remote is unresponsive. In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is detached. Upon return from rpcrdma_xprt_disconnect(), the transport (r_xprt) is ready immediately for a new connection. The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now handled almost identically. However, because the lifetimes of rpcrdma_xprt structures and rpcrdma_ep structures are now independent, creating an rpcrdma_ep needs to take a module ref count. The ep now owns most of the hardware resources for a transport. Also, a kref is needed to ensure that rpcrdma_ep sticks around long enough for the cm_event_handler to finish. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Extract sockaddr from struct rdma_cm_idChuck Lever
rpcrdma_cm_event_handler() is always passed an @id pointer that is valid. However, in a subsequent patch, we won't be able to extract an r_xprt in every case. So instead of using the r_xprt's presentation address strings, extract them from struct rdma_cm_id. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_epChuck Lever
I eventually want to allocate rpcrdma_ep separately from struct rpcrdma_xprt so that on occasion there can be more than one ep per xprt. The new struct rpcrdma_ep will contain all the fields currently in rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings for the connection, in addition to per-connection settings negotiated with the remote. Take this opportunity to rename the existing ep fields from rep_* to re_* to disambiguate these from struct rpcrdma_rep. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Disconnect on flushed completionChuck Lever
Completion errors after a disconnect often occur much sooner than a CM_DISCONNECT event. Use this to try to detect connection loss more quickly. Note that other kernel ULPs do take care to disconnect explicitly when a WR is flushed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Remove rpcrdma_ia::ri_flagsChuck Lever
Clean up: The upper layer serializes calls to xprt_rdma_close, so there is no need for an atomic bit operation, saving 8 bytes in rpcrdma_ia. This enables merging rpcrdma_ia_remove directly into the disconnect logic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ia_open in the connect workerChuck Lever
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now responsible for allocating all per-connection hardware resources. With this clean-up, all three arms of the switch statement in rpcrdma_ep_connect are exactly the same now, thus the switch can be removed. Because device removal behaves a little differently than disconnection, there is a little more work to be done before rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So it is not quite symmetrical with rpcrdma_ep_create() yet. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()Chuck Lever
Make a Protection Domain (PD) a per-connection resource rather than a per-transport resource. In other words, when the connection terminates, the PD is destroyed. Thus there is one less HW resource that remains allocated to a transport after a connection is closed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()Chuck Lever
Clean up: Simplify the synopses of functions in the connect and disconnect paths in preparation for combining the rpcrdma_ia and struct rpcrdma_ep structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Clean up the post_send pathChuck Lever
Clean up: Simplify the synopses of functions in the post_send path by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Refactor frwr_init_mr()Chuck Lever
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep structures. Take the opportunity to rename the function to be consistent with the "subsystem _ object _ verb" naming scheme. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Invoke rpcrdma_ep_create() in the connect workerChuck Lever
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and rpcrdma_ep_destroy(). rpcrdma_ep_create will be invoked at connect time instead of at transport set-up time. It will be responsible for allocating per- connection resources. In this patch it allocates the CQs and creates a QP. More to come. rpcrdma_ep_destroy() is the inverse functionality that is invoked at disconnect time. It will be responsible for releasing the CQs and QP. These changes should be safe to do because both connect and disconnect is guaranteed to be serialized by the transport send lock. This takes us another step closer to resolving the address and route only at connect time so that connection failover to another device will work correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27xprtrdma: Enhance MR-related trace pointsChuck Lever
Two changes: - Show the number of SG entries that were mapped. This helps debug DMA-related problems. - Record the MR's resource ID instead of its memory address. This groups each MR with its associated rdma-tool output, and reduces needless exposure of memory addresses. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-26SUNRPC: fix krb5p mount to provide large enough buffer in rq_rcvsizeOlga Kornievskaia
Ever since commit 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It used to use au_cslack value which was nice and large and changed it to au_rslack value which turns out to be too small. Since 5.1, v3 mount with sec=krb5p fails against an Ontap server because client's receive buffer it too small. For gss krb5p, we need to account for the mic token in the verifier, and the wrap token in the wrap token. RFC 4121 defines: mic token Octet no Name Description -------------------------------------------------------------- 0..1 TOK_ID Identification field. Tokens emitted by GSS_GetMIC() contain the hex value 04 04 expressed in big-endian order in this field. 2 Flags Attributes field, as described in section 4.2.2. 3..7 Filler Contains five octets of hex value FF. 8..15 SND_SEQ Sequence number field in clear text, expressed in big-endian order. 16..last SGN_CKSUM Checksum of the "to-be-signed" data and octet 0..15, as described in section 4.2.4. that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum wrap token Octet no Name Description -------------------------------------------------------------- 0..1 TOK_ID Identification field. Tokens emitted by GSS_Wrap() contain the hex value 05 04 expressed in big-endian order in this field. 2 Flags Attributes field, as described in section 4.2.2. 3 Filler Contains the hex value FF. 4..5 EC Contains the "extra count" field, in big- endian order as described in section 4.2.3. 6..7 RRC Contains the "right rotation count" in big- endian order, as described in section 4.2.5. 8..15 SND_SEQ Sequence number field in clear text, expressed in big-endian order. 16..last Data Encrypted data for Wrap tokens with confidentiality, or plaintext data followed by the checksum for Wrap tokens without confidentiality, as described in section 4.2.4. Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum (other things like padding) RFC 3961 defines known cksum sizes: Checksum type sumtype checksum section or value size reference --------------------------------------------------------------------- CRC32 1 4 6.1.3 rsa-md4 2 16 6.1.2 rsa-md4-des 3 24 6.2.5 des-mac 4 16 6.2.7 des-mac-k 5 8 6.2.8 rsa-md4-des-k 6 16 6.2.6 rsa-md5 7 16 6.1.1 rsa-md5-des 8 24 6.2.4 rsa-md5-des3 9 24 ?? sha1 (unkeyed) 10 20 ?? hmac-sha1-des3-kd 12 20 6.3 hmac-sha1-des3 13 20 ?? sha1 (unkeyed) 14 20 ?? hmac-sha1-96-aes128 15 20 [KRB5-AES] hmac-sha1-96-aes256 16 20 [KRB5-AES] [reserved] 0x8003 ? [GSS-KRB5] Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes. (GSS_KRB5_MAX_CKSUM_LEN) Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used for encoding the gss_wrap tokens (same tokens are used in reply). Fixes: 2c94b8eca1a2 ("SUNRPC: Use au_rslack when computing reply buffer size") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16sunrpc: Add tracing for cache eventsTrond Myklebust
Add basic tracing for debugging the sunrpc cache events. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC/cache: Allow garbage collection of invalid cache entriesTrond Myklebust
If the cache entry never gets initialised, we want the garbage collector to be able to evict it. Otherwise if the upcall daemon fails to initialise the entry, we end up never expiring it. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> [ cel: resolved a merge conflict ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16nfsd: export upcalls must not return ESTALE when mountd is downTrond Myklebust
If the rpc.mountd daemon goes down, then that should not cause all exports to start failing with ESTALE errors. Let's explicitly distinguish between the cache upcall cases that need to time out, and those that do not. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC: Teach server to use xprt_sock_sendmsg for socket sendsChuck Lever
xprt_sock_sendmsg uses the more efficient iov_iter-enabled kernel socket API, and is a pre-requisite for server send-side support for TLS. Note that svc_process no longer needs to reserve a word for the stream record marker, since the TCP transport now provides the record marker automatically in a separate buffer. The dprintk() in svc_send_common is also removed. It didn't seem crucial for field troubleshooting. If more is needed there, a trace point could be added in xprt_sock_sendmsg(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC: Refactor xs_sendpages()Chuck Lever
Re-locate xs_sendpages() so that it can be shared with server code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Avoid DMA mapping small RPC RepliesChuck Lever
On some platforms, DMA mapping part of a page is more costly than copying bytes. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Since pull-up is now a more a frequent operation, I've introduced a trace point in the pull-up path. It can be used for debugging or user-space tools that count pull-up frequency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Fix double sync of transport header bufferChuck Lever
Performance optimization: Avoid syncing the transport buffer twice when Reply buffer pull-up is necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Refactor chunk list encodersChuck Lever
Same idea as the receive-side changes I did a while back: use xdr_stream helpers rather than open-coding the XDR chunk list encoders. This builds the Reply transport header from beginning to end without backtracking. As additional clean-ups, fill in documenting comments for the XDR encoders and sprinkle some trace points in the new encoding functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16SUNRPC: Add encoders for list item discriminatorsChuck Lever
Clean up. These are taken from the client-side RPC/RDMA transport to a more global header file so they can be used elsewhere. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Rename svcrdma_encode trace points in send routinesChuck Lever
These trace points are misnamed: trace_svcrdma_encode_wseg trace_svcrdma_encode_write trace_svcrdma_encode_reply trace_svcrdma_encode_rseg trace_svcrdma_encode_read trace_svcrdma_encode_pzr Because they actually trace posting on the Send Queue. Let's rename them so that I can add trace points in the chunk list encoders that actually do trace chunk list encoding events. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_send_reply_msg()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into those lower-level functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_map_reply_msg()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into those lower-level functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Update synopsis of svc_rdma_send_reply_chunk()Chuck Lever
Preparing for subsequent patches, no behavior change expected. Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto() path. This enables passing more information about Requester- provided Write and Reply chunks into the lower-level send functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: De-duplicate code that locates Write and Reply chunksChuck Lever
Cache the locations of the Requester-provided Write list and Reply chunk so that the Send path doesn't need to parse the Call header again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Use struct xdr_stream to decode ingress transport headersChuck Lever
The logic that checks incoming network headers has to be scrupulous. De-duplicate: replace open-coded buffer overflow checks with the use of xdr_stream helpers that are used most everywhere else XDR decoding is done. One minor change to the sanity checks: instead of checking the length of individual segments, cap the length of the whole chunk to be sure it can fit in the set of pages available in rq_pages. This should be a better test of whether the server can handle the chunks in each request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Remove svcrdma_cm_event() trace pointChuck Lever
Clean up. This trace point is no longer needed because the RDMA/core CMA code has an equivalent trace point that was added by commit ed999f820a6c ("RDMA/cma: Add trace points in RDMA Connection Manager"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Create a generic tracing class for displaying xdr_buf layoutChuck Lever
This class can be used to create trace points in either the RPC client or RPC server paths. It simply displays the length of each part of an xdr_buf, which is useful to determine that the transport and XDR codecs are operating correctly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>