summaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)Author
2022-05-07SUNRPC: Ensure that the gssproxy client can start in a connected stateTrond Myklebust
Ensure that the gssproxy client connects to the server from the gssproxy daemon process context so that the AF_LOCAL socket connection is done using the correct path and namespaces. Fixes: 1d658336b05f ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-05-07Revert "SUNRPC: Ensure gss-proxy connects on setup"Trond Myklebust
This reverts commit 892de36fd4a98fab3298d417c051d9099af5448d. The gssproxy server is unresponsive when it calls into the kernel to start the upcall service, so it will not reply to our RPC ping at all. Reported-by: "J.Bruce Fields" <bfields@fieldses.org> Fixes: 892de36fd4a9 ("SUNRPC: Ensure gss-proxy connects on setup") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29Revert "SUNRPC: attempt AF_LOCAL connect on setup"Trond Myklebust
This reverts commit 7073ea8799a8cf73db60270986f14e4aae20fa80. We must not try to connect the socket while the transport is under construction, because the mechanisms to safely tear it down are not in place. As the code stands, we end up leaking the sockets on a connection error. Reported-by: wanghai (M) <wanghai38@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29SUNRPC: Ensure gss-proxy connects on setupTrond Myklebust
For reasons best known to the author, gss-proxy does not implement a NULL procedure, and returns RPC_PROC_UNAVAIL. However we still want to ensure that we connect to the service at setup time. So add a quirk-flag specially for this case. Fixes: 1d658336b05f ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-29SUNRPC: Ensure timely close of disconnected AF_LOCAL socketsTrond Myklebust
When the rpcbind server closes the socket, we need to ensure that the socket is closed by the kernel as soon as feasible, so add a sk_state_change callback to trigger this close. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-28SUNRPC: Don't leak sockets in xs_local_connect()Trond Myklebust
If there is still a closed socket associated with the transport, then we need to trigger an autoclose before we can set up a new connection. Reported-by: wanghai (M) <wanghai38@huawei.com> Fixes: f00432063db1 ("SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-22SUNRPC release the transport of a relocated task with an assigned transportOlga Kornievskaia
A relocated task must release its previous transport. Fixes: 82ee41b85cef1 ("SUNRPC don't resend a task on an offlined transport") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-12Merge tag 'nfsd-5.18-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a write performance regression - Fix crashes during request deferral on RDMA transports * tag 'nfsd-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Fix the svc_deferred_event trace class SUNRPC: Fix NFSD's request deferral on RDMA transports nfsd: Clean up nfsd_file_put() nfsd: Fix a write performance regression SUNRPC: Return true/false (not 1/0) from bool functions
2022-04-08Merge tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Stable fixes: - SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() Bugfixes: - Fix an Oopsable condition due to SLAB_ACCOUNT setting in the NFSv4.2 xattr code. - Fix for open() using an file open mode of '3' in NFSv4 - Replace readdir's use of xxhash() with hash_64() - Several patches to handle malloc() failure in SUNRPC" * tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg() SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec() SUNRPC: Handle allocation failure in rpc_new_task() NFS: Ensure rpc_run_task() cannot fail in nfs_async_rename() NFSv4/pnfs: Handle RPC allocation errors in nfs4_proc_layoutget SUNRPC: Handle low memory situations in call_status() SUNRPC: Handle ENOMEM in call_transmit_status() NFSv4.2: Fix missing removal of SLAB_ACCOUNT on kmem_cache allocation SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() NFS: Replace readdir's use of xxhash() with hash_64() SUNRPC: handle malloc failure in ->request_prepare NFSv4: fix open failure with O_ACCMODE flag Revert "NFSv4: Handle the special Linux file open access mode"
2022-04-07SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()Trond Myklebust
The client and server have different requirements for their memory allocation, so move the allocation of the send buffer out of the socket send code that is common to both. Reported-by: NeilBrown <neilb@suse.de> Fixes: b2648015d452 ("SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()Trond Myklebust
The allocation is done with GFP_KERNEL, but it could still fail in a low memory situation. Fixes: 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07SUNRPC: Handle allocation failure in rpc_new_task()Trond Myklebust
If the call to rpc_alloc_task() fails, then ensure that the calldata is released, and that rpc_run_task() and rpc_run_bc_task() bail out early. Reported-by: NeilBrown <neilb@suse.de> Fixes: 910ad38697d9 ("NFS: Fix memory allocation in rpc_alloc_task()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07SUNRPC: Handle low memory situations in call_status()Trond Myklebust
We need to handle ENFILE, ENOBUFS, and ENOMEM, because xprt_wake_pending_tasks() can be called with any one of these due to socket creation failures. Fixes: b61d59fffd3e ("SUNRPC: xs_tcp_connect_worker{4,6}: merge common code") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07SUNRPC: Handle ENOMEM in call_transmit_status()Trond Myklebust
Both call_transmit() and call_bc_transmit() can now return ENOMEM, so let's make sure that we handle the errors gracefully. Fixes: 0472e4766049 ("SUNRPC: Convert socket page send code to use iov_iter()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-07SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()Trond Myklebust
We must ensure that all sockets are closed before we call xprt_free() and release the reference to the net namespace. The problem is that calling fput() will defer closing the socket until delayed_fput() gets called. Let's fix the situation by allowing rpciod and the transport teardown code (which runs on the system wq) to call __fput_sync(), and directly close the socket. Reported-by: Felix Fu <foyjog@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: a73881c96d73 ("SUNRPC: Fix an Oops in udp_poll()") Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect Cc: stable@vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket Cc: stable@vger.kernel.org # 5.1.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-04-06SUNRPC: Fix NFSD's request deferral on RDMA transportsChuck Lever
Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further investigation shows that the crash occurred while NFSD was handling a deferred request. This patch addresses two inter-related issues that prevent request deferral from working correctly for RPC/RDMA requests: 1. Prevent the crash by ensuring that the original svc_rqst::rq_xprt_ctxt value is available when the request is revisited. Otherwise svc_rdma_sendto() does not have a Receive context available with which to construct its reply. 2. Possibly since before commit 71641d99ce03 ("svcrdma: Properly compute .len and .buflen for received RPC Calls"), svc_rdma_recvfrom() did not include the transport header in the returned xdr_buf. There should have been no need for svc_defer() and friends to save and restore that header, as of that commit. This issue is addressed in a backport-friendly way by simply having svc_rdma_recvfrom() set rq_xprt_hlen to zero unconditionally, just as svc_tcp_recvfrom() does. This enables svc_deferred_recv() to correctly reconstruct an RPC message received via RPC/RDMA. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org>
2022-03-29SUNRPC: handle malloc failure in ->request_prepareNeilBrown
If ->request_prepare() detects an error, it sets ->rq_task->tk_status. This is easy for callers to ignore. The only caller is xprt_request_enqueue_receive() and it does ignore the error, as does call_encode() which calls it. This can result in a request being queued to receive a reply without an allocated receive buffer. So instead of setting rq_task->tk_status, return an error, and store in ->tk_status only in call_encode(); The call to xprt_request_enqueue_receive() is now earlier in call_encode(), where the error can still be handled. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-29Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Switch NFS to use readahead instead of the obsolete readpages. - Readdir fixes to improve cacheability of large directories when there are multiple readers and writers. - Readdir performance improvements when doing a seekdir() immediately after opening the directory (common when re-exporting NFS). - NFS swap improvements from Neil Brown. - Loosen up memory allocation to permit direct reclaim and write back in cases where there is no danger of deadlocking the writeback code or NFS swap. - Avoid sillyrename when the NFSv4 server claims to support the necessary features to recover the unlinked but open file after reboot. Bugfixes: - Patch from Olga to add a mount option to control NFSv4.1 session trunking discovery, and default it to being off. - Fix a lockup in nfs_do_recoalesce(). - Two fixes for list iterator variables being used when pointing to the list head. - Fix a kernel memory scribble when reading from a non-socket transport in /sys/kernel/sunrpc. - Fix a race where reconnecting to a server could leave the TCP socket stuck forever in the connecting state. - Patch from Neil to fix a shutdown race which can leave the SUNRPC transport timer primed after we free the struct xprt itself. - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy offload. - Sunrpc patch from Olga to avoid resending a task on an offlined transport. Cleanups: - Patches from Dave Wysochanski to clean up the fscache code" * tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits) NFSv4/pNFS: Fix another issue with a list iterator pointing to the head NFS: Don't loop forever in nfs_do_recoalesce() SUNRPC: Don't return error values in sysfs read of closed files SUNRPC: Do not dereference non-socket transports in sysfs NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error SUNRPC don't resend a task on an offlined transport NFS: replace usage of found with dedicated list iterator variable SUNRPC: avoid race between mod_timer() and del_timer_sync() pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod NFS: Avoid writeback threads getting stuck in mempool_alloc() NFS: nfsiod should not block forever in mempool_alloc() SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent SUNRPC: Fix unx_lookup_cred() allocation NFS: Fix memory allocation in rpc_alloc_task() NFS: Fix memory allocation in rpc_malloc() SUNRPC: Improve accuracy of socket ENOBUFS determination SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE SUNRPC: Fix socket waits for write buffer space ...
2022-03-25SUNRPC: Don't return error values in sysfs read of closed filesTrond Myklebust
Instead of returning an error value, which ends up being the return value for the read() system call, it is more elegant to simply return the error as a string value. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25SUNRPC: Do not dereference non-socket transports in sysfsTrond Myklebust
Do not cast the struct xprt to a sock_xprt unless we know it is a UDP or TCP transport. Otherwise the call to lock the mutex will scribble over whatever structure is actually there. This has been seen to cause hard system lockups when the underlying transport was RDMA. Fixes: b49ea673e119 ("SUNRPC: lock against ->sock changing during sysfs read") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-24Merge tag 'net-next-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "The sprinkling of SPI drivers is because we added a new one and Mark sent us a SPI driver interface conversion pull request. Core ---- - Introduce XDP multi-buffer support, allowing the use of XDP with jumbo frame MTUs and combination with Rx coalescing offloads (LRO). - Speed up netns dismantling (5x) and lower the memory cost a little. Remove unnecessary per-netns sockets. Scope some lists to a netns. Cut down RCU syncing. Use batch methods. Allow netdev registration to complete out of order. - Support distinguishing timestamp types (ingress vs egress) and maintaining them across packet scrubbing points (e.g. redirect). - Continue the work of annotating packet drop reasons throughout the stack. - Switch netdev error counters from an atomic to dynamically allocated per-CPU counters. - Rework a few preempt_disable(), local_irq_save() and busy waiting sections problematic on PREEMPT_RT. - Extend the ref_tracker to allow catching use-after-free bugs. BPF --- - Introduce "packing allocator" for BPF JIT images. JITed code is marked read only, and used to be allocated at page granularity. Custom allocator allows for more efficient memory use, lower iTLB pressure and prevents identity mapping huge pages from getting split. - Make use of BTF type annotations (e.g. __user, __percpu) to enforce the correct probe read access method, add appropriate helpers. - Convert the BPF preload to use light skeleton and drop the user-mode-driver dependency. - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling its use as a packet generator. - Allow local storage memory to be allocated with GFP_KERNEL if called from a hook allowed to sleep. - Introduce fprobe (multi kprobe) to speed up mass attachment (arch bits to come later). - Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra. - Allow cgroup BPF progs to return custom errors to user space. - Add support for AF_UNIX iterator batching. - Allow iterator programs to use sleepable helpers. - Support JIT of add, and, or, xor and xchg atomic ops on arm64. - Add BTFGen support to bpftool which allows to use CO-RE in kernels without BTF info. - Large number of libbpf API improvements, cleanups and deprecations. Protocols --------- - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev. - Adjust TSO packet sizes based on min_rtt, allowing very low latency links (data centers) to always send full-sized TSO super-frames. - Make IPv6 flow label changes (AKA hash rethink) more configurable, via sysctl and setsockopt. Distinguish between server and client behavior. - VxLAN support to "collect metadata" devices to terminate only configured VNIs. This is similar to VLAN filtering in the bridge. - Support inserting IPv6 IOAM information to a fraction of frames. - Add protocol attribute to IP addresses to allow identifying where given address comes from (kernel-generated, DHCP etc.) - Support setting socket and IPv6 options via cmsg on ping6 sockets. - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS. Define dscp_t and stop taking ECN bits into account in fib-rules. - Add support for locked bridge ports (for 802.1X). - tun: support NAPI for packets received from batched XDP buffs, doubling the performance in some scenarios. - IPv6 extension header handling in Open vSwitch. - Support IPv6 control message load balancing in bonding, prevent neighbor solicitation and advertisement from using the wrong port. Support NS/NA monitor selection similar to existing ARP monitor. - SMC - improve performance with TCP_CORK and sendfile() - support auto-corking - support TCP_NODELAY - MCTP (Management Component Transport Protocol) - add user space tag control interface - I2C binding driver (as specified by DMTF DSP0237) - Multi-BSSID beacon handling in AP mode for WiFi. - Bluetooth: - handle MSFT Monitor Device Event - add MGMT Adv Monitor Device Found/Lost events - Multi-Path TCP: - add support for the SO_SNDTIMEO socket option - lots of selftest cleanups and improvements - Increase the max PDU size in CAN ISOTP to 64 kB. Driver API ---------- - Add HW counters for SW netdevs, a mechanism for devices which offload packet forwarding to report packet statistics back to software interfaces such as tunnels. - Select the default NIC queue count as a fraction of number of physical CPU cores, instead of hard-coding to 8. - Expose devlink instance locks to drivers. Allow device layer of drivers to use that lock directly instead of creating their own which always runs into ordering issues in devlink callbacks. - Add header/data split indication to guide user space enabling of TCP zero-copy Rx. - Allow configuring completion queue event size. - Refactor page_pool to enable fragmenting after allocation. - Add allocation and page reuse statistics to page_pool. - Improve Multiple Spanning Trees support in the bridge to allow reuse of topologies across VLANs, saving HW resources in switches. - DSA (Distributed Switch Architecture): - replay and offload of host VLAN entries - offload of static and local FDB entries on LAG interfaces - FDB isolation and unicast filtering New hardware / drivers ---------------------- - Ethernet: - LAN937x T1 PHYs - Davicom DM9051 SPI NIC driver - Realtek RTL8367S, RTL8367RB-VB switch and MDIO - Microchip ksz8563 switches - Netronome NFP3800 SmartNICs - Fungible SmartNICs - MediaTek MT8195 switches - WiFi: - mt76: MediaTek mt7916 - mt76: MediaTek mt7921u USB adapters - brcmfmac: Broadcom BCM43454/6 - Mobile: - iosm: Intel M.2 7360 WWAN card Drivers ------- - Convert many drivers to the new phylink API built for split PCS designs but also simplifying other cases. - Intel Ethernet NICs: - add TTY for GNSS module for E810T device - improve AF_XDP performance - GTP-C and GTP-U filter offload - QinQ VLAN support - Mellanox Ethernet NICs (mlx5): - support xdp->data_meta - multi-buffer XDP - offload tc push_eth and pop_eth actions - Netronome Ethernet NICs (nfp): - flow-independent tc action hardware offload (police / meter) - AF_XDP - Other Ethernet NICs: - at803x: fiber and SFP support - xgmac: mdio: preamble suppression and custom MDC frequencies - r8169: enable ASPM L1.2 if system vendor flags it as safe - macb/gem: ZynqMP SGMII - hns3: add TX push mode - dpaa2-eth: software TSO - lan743x: multi-queue, mdio, SGMII, PTP - axienet: NAPI and GRO support - Mellanox Ethernet switches (mlxsw): - source and dest IP address rewrites - RJ45 ports - Marvell Ethernet switches (prestera): - basic routing offload - multi-chain TC ACL offload - NXP embedded Ethernet switches (ocelot & felix): - PTP over UDP with the ocelot-8021q DSA tagging protocol - basic QoS classification on Felix DSA switch using dcbnl - port mirroring for ocelot switches - Microchip high-speed industrial Ethernet (sparx5): - offloading of bridge port flooding flags - PTP Hardware Clock - Other embedded switches: - lan966x: PTP Hardward Clock - qca8k: mdio read/write operations via crafted Ethernet packets - Qualcomm 802.11ax WiFi (ath11k): - add LDPC FEC type and 802.11ax High Efficiency data in radiotap - enable RX PPDU stats in monitor co-exist mode - Intel WiFi (iwlwifi): - UHB TAS enablement via BIOS - band disablement via BIOS - channel switch offload - 32 Rx AMPDU sessions in newer devices - MediaTek WiFi (mt76): - background radar detection - thermal management improvements on mt7915 - SAR support for more mt76 platforms - MBSSID and 6 GHz band on mt7915 - RealTek WiFi: - rtw89: AP mode - rtw89: 160 MHz channels and 6 GHz band - rtw89: hardware scan - Bluetooth: - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS) - Microchip CAN (mcp251xfd): - multiple RX-FIFOs and runtime configurable RX/TX rings - internal PLL, runtime PM handling simplification - improve chip detection and error handling after wakeup" * tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits) llc: fix netdevice reference leaks in llc_ui_bind() drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool ice: don't allow to run ice_send_event_to_aux() in atomic ctx ice: fix 'scheduling while atomic' on aux critical err interrupt net/sched: fix incorrect vlan_push_eth dest field net: bridge: mst: Restrict info size queries to bridge ports net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init() drivers: net: xgene: Fix regression in CRC stripping net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT net: dsa: fix missing host-filtered multicast addresses net/mlx5e: Fix build warning, detected write beyond size of field iwlwifi: mvm: Don't fail if PPAG isn't supported selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" netdevice: add missing dm_private kdoc net: bridge: mst: prevent NULL deref in br_mst_info_size() selftests: forwarding: Use same VRF for port and VLAN upper ...
2022-03-24SUNRPC don't resend a task on an offlined transportOlga Kornievskaia
When a task is being retried, due to an NFS error, if the assigned transport has been put offline and the task is relocatable pick a new transport. Fixes: 6f081693e7b2b ("sunrpc: remove an offlined xprt using sysfs") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-23SUNRPC: avoid race between mod_timer() and del_timer_sync()NeilBrown
xprt_destory() claims XPRT_LOCKED and then calls del_timer_sync(). Both xprt_unlock_connect() and xprt_release() call ->release_xprt() which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect() which calls mod_timer(). This may result in mod_timer() being called *after* del_timer_sync(). When this happens, the timer may fire long after the xprt has been freed, and run_timer_softirq() will probably crash. The pairing of ->release_xprt() and xprt_schedule_autodisconnect() is always called under ->transport_lock. So if we take ->transport_lock to call del_timer_sync(), we can be sure that mod_timer() will run first (if it runs at all). Cc: stable@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22SUNRPC: Make the rpciod and xprtiod slab allocation modes consistentTrond Myklebust
Make sure that rpciod and xprtiod are always using the same slab allocation modes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Fix unx_lookup_cred() allocationTrond Myklebust
Default to the same mempool allocation strategy as for rpc_malloc(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22NFS: Fix memory allocation in rpc_alloc_task()Trond Myklebust
As for rpc_malloc(), we first try allocating from the slab, then fall back to a non-waiting allocation from the mempool. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22NFS: Fix memory allocation in rpc_malloc()Trond Myklebust
When in a low memory situation, we do want rpciod to kick off direct reclaim in the case where that helps, however we don't want it looping forever in mempool_alloc(). So first try allocating from the slab using GFP_KERNEL | __GFP_NORETRY, and then fall back to a GFP_NOWAIT allocation from the mempool. Ditto for rpc_alloc_task() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Improve accuracy of socket ENOBUFS determinationTrond Myklebust
The current code checks for whether or not the socket is in a writeable state after we get an EAGAIN. That is racy, since we've dropped the socket lock, so the amount of free buffer may have changed. Instead, let's check whether the socket is writeable before we try to write to it. If that was the case, we do expect the message to be at least partially sent unless we're in a low memory situation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACETrond Myklebust
The socket's SOCKWQ_ASYNC_NOSPACE can be cleared by various actors in the socket layer, so replace it with our own flag in the transport sock_state field. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Fix socket waits for write buffer spaceTrond Myklebust
The socket layer requires that we use the socket lock to protect changes to the sock->sk_write_pending field and others. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Only save the TCP source port after the connection is completeTrond Myklebust
Since the RPC client uses a non-blocking connect(), we do not expect to see it return '0' under normal circumstances. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-22SUNRPC: Don't call connect() more than once on a TCP socketTrond Myklebust
Avoid socket state races due to repeated calls to ->connect() using the same socket. If connect() returns 0 due to the connection having completed, but we are in fact in a closing state, then we may leave the XPRT_CONNECTING flag set on the transport. Reported-by: Enrico Scholz <enrico.scholz@sigma-chemnitz.de> Fixes: 3be232f11a3c ("SUNRPC: Prevent immediate close+reconnect") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: change locking for xs_swap_enable/disableNeilBrown
It is not in general safe to wait for XPRT_LOCKED to clear. A wakeup is only sent when - connection completes - sock close completes so during normal operations, this can wait indefinitely. The event we need to protect against is ->inet being set to NULL, and that happens under the recv_mutex lock. So drop the handlign of XPRT_LOCKED and use recv_mutex instead. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFSv4: keep state manager thread active if swap is enabledNeilBrown
If we are swapping over NFSv4, we may not be able to allocate memory to start the state-manager thread at the time when we need it. So keep it always running when swap is enabled, and just signal it to start. This requires updating and testing the cl_swapper count on the root rpc_clnt after following all ->cl_parent links. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOCNeilBrown
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> (for xprtrdma parts) Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDSNeilBrown
NFS_RPC_SWAPFLAGS is only used for READ requests. It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to requests. This is not needed for swap READ - though it is for writes where it is set via a different mechanism. RPC_TASK_ROOTCREDS causes the 'machine' credential to be used. This is not needed as the root credential is saved when the swap file is opened, and this is used for all IO. So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of RPC_TASK_ROOTCREDS, that isn't needed either. Remove both. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC: remove scheduling boost for "SWAPPER" tasks.NeilBrown
Currently, tasks marked as "swapper" tasks get put to the front of non-priority rpc_queues, and are sorted earlier than non-swapper tasks on the transport's ->xmit_queue. This is pointless as currently *all* tasks for a mount that has swap enabled on *any* file are marked as "swapper" tasks. So the net result is that the non-priority rpc_queues are reverse-ordered (LIFO). This scheduling boost is not necessary to avoid deadlocks, and hurts fairness, so remove it. If there were a need to expedite some requests, the tk_priority mechanism is a more appropriate tool. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/xprt: async tasks mustn't block waiting for memoryNeilBrown
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. xprt_dynamic_alloc_slot can block indefinitely. This can tie up all workqueue threads and NFS can deadlock. So when called from a workqueue, set __GFP_NORETRY. The rdma alloc_slot already does not block. However it sets the error to -EAGAIN suggesting this will trigger a sleep. It does not. As we can see in call_reserveresult(), only -ENOMEM causes a sleep. -EAGAIN causes immediate retry. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/auth: async tasks mustn't block waiting for memoryNeilBrown
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. lookup_cred() can block on a mempool or kmalloc - and this can cause deadlocks. So add a new RPCAUTH_LOOKUP flag for async lookups and don't block on memory. If the -ENOMEM gets back to call_refreshresult(), wait a short while and try again. HZ>>4 is chosen as it is used elsewhere for -ENOMEM retries. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-13SUNRPC/call_alloc: async tasks mustn't block waiting for memoryNeilBrown
When memory is short, new worker threads cannot be created and we depend on the minimum one rpciod thread to be able to handle everything. So it must not block waiting for memory. mempools are particularly a problem as memory can only be released back to the mempool by an async rpc task running. If all available workqueue threads are waiting on the mempool, no thread is available to return anything. rpc_malloc() can block, and this might cause deadlocks. So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if blocking is acceptable. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-02-28SUNRPC: Teach server to recognize RPC_AUTH_TLSChuck Lever
Initial support for the RPC_AUTH_TLS authentication flavor enables NFSD to eventually accept an RPC_AUTH_TLS probe from clients. This patch simply prevents NFSD from rejecting these probes completely. In the meantime, graft this support in now so that RPC_AUTH_TLS support keeps up with generic code and API changes in the RPC server. Down the road, server-side transport implementations will populate xpo_start_tls when they can support RPC-with-TLS. For example, TCP will eventually populate it, but RDMA won't. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28NFSD: Move svc_serv_ops::svo_function into struct svc_servChuck Lever
Hoist svo_function back into svc_serv and remove struct svc_serv_ops, since the struct is now devoid of fields. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28NFSD: Remove svc_serv_ops::svo_moduleChuck Lever
struct svc_serv_ops is about to be removed. Neil Brown says: > I suspect svo_module can go as well - I don't think the thread is > ever the thing that primarily keeps a module active. A random sample of kthread_create() callers shows sunrpc is the only one that manages module reference count in this way. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28SUNRPC: Remove svc_shutdown_net()Chuck Lever
Clean up: svc_shutdown_net() now does nothing but call svc_close_net(). Replace all external call sites. svc_close_net() is renamed to be the inverse of svc_xprt_create(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28SUNRPC: Rename svc_close_xprt()Chuck Lever
Clean up: Use the "svc_xprt_<task>" function naming convention as is used for other external APIs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28SUNRPC: Rename svc_create_xprt()Chuck Lever
Clean up: Use the "svc_xprt_<task>" function naming convention as is used for other external APIs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-02-28SUNRPC: Remove svo_shutdown methodChuck Lever
Clean up. Neil observed that "any code that calls svc_shutdown_net() knows what the shutdown function should be, and so can call it directly." Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neilb@suse.de>
2022-02-28SUNRPC: Merge svc_do_enqueue_xprt() into svc_enqueue_xprt()Chuck Lever
Neil says: "These functions were separated in commit 0971374e2818 ("SUNRPC: Reduce contention in svc_xprt_enqueue()") so that the XPT_BUSY check happened before taking any spinlocks. We have since moved or removed the spinlocks so the extra test is fairly pointless." I've made this a separate patch in case the XPT_BUSY change has unexpected consequences and needs to be reverted. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>