5 files changed, 731 insertions, 0 deletions
diff --git a/Documentation/filesystems/fuse/fuse-io-uring.rst b/Documentation/filesystems/fuse/fuse-io-uring.rst
new file mode 100644
index 000000000000..d73dd0dbd238
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse-io-uring.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+FUSE-over-io-uring design documentation
+=======================================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through io-uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through io-uring, userspace
+is required to also handle requests through /dev/fuse after io-uring setup
+is complete. Specifically notifications (initiated from the daemon side)
+and interrupts.
+
+Fuse io-uring configuration
+===========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until io-uring setup is complete.
+
+In order to set up fuse-over-io-uring fuse-server (user-space)
+needs to submit SQEs (opcode = IORING_OP_URING_CMD) to the /dev/fuse
+connection file descriptor. Initial submit is with the sub command
+FUSE_URING_REQ_REGISTER, which will just register entries to be
+available in the kernel.
+
+Once at least one entry per queue is submitted, kernel starts
+to enqueue to ring queues.
+Note, every CPU core has its own fuse-io-uring queue.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------::
+
+ |                                    |  FUSE filesystem daemon
+ |                                    |
+ |                                    |  >io_uring_submit()
+ |                                    |   IORING_OP_URING_CMD /
+ |                                    |   FUSE_URING_CMD_REGISTER
+ |                                    |  [wait cqe]
+ |                                    |   >io_uring_wait_cqe() or
+ |                                    |   >io_uring_submit_and_wait()
+ |                                    |
+ |  >fuse_uring_cmd()                 |
+ |   >fuse_uring_register()           |
+
+
+Sending requests with CQEs
+--------------------------::
+
+ |                                           |  FUSE filesystem daemon
+ |                                           |  [waiting for CQEs]
+ |  "rm /mnt/fuse/file"                      |
+ |                                           |
+ |  >sys_unlink()                            |
+ |    >fuse_unlink()                         |
+ |      [allocate request]                   |
+ |      >fuse_send_one()                     |
+ |        ...                                |
+ |       >fuse_uring_queue_fuse_req          |
+ |        [queue request on fg queue]        |
+ |         >fuse_uring_add_req_to_ring_ent() |
+ |         ...                               |
+ |          >fuse_uring_copy_to_ring()       |
+ |          >io_uring_cmd_done()             |
+ |       >request_wait_answer()              |
+ |         [sleep on req->waitq]             |
+ |                                           |  [receives and handles CQE]
+ |                                           |  [submit result and fetch next]
+ |                                           |  >io_uring_submit()
+ |                                           |   IORING_OP_URING_CMD/
+ |                                           |   FUSE_URING_CMD_COMMIT_AND_FETCH
+ |  >fuse_uring_cmd()                        |
+ |   >fuse_uring_commit_fetch()              |
+ |    >fuse_uring_commit()                   |
+ |     >fuse_uring_copy_from_ring()          |
+ |      [ copy the result to the fuse req]   |
+ |     >fuse_uring_req_end()                 |
+ |      >fuse_request_end()                  |
+ |       [wake up req->waitq]                |
+ |    >fuse_uring_next_fuse_req              |
+ |       [wait or handle next req]           |
+ |                                           |
+ |       [req->waitq woken up]               |
+ |    <fuse_unlink()                         |
+ |  <sys_unlink()                            |
+
+
+
diff --git a/Documentation/filesystems/fuse/fuse-io.rst b/Documentation/filesystems/fuse/fuse-io.rst
new file mode 100644
index 000000000000..d736ac4cb483
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse-io.rst
@@ -0,0 +1,45 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+FUSE I/O Modes
+==============
+
+Fuse supports the following I/O modes:
+
+- direct-io
+- cached
+  + write-through
+  + writeback-cache
+
+The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the
+FUSE_OPEN reply.
+
+In direct-io mode the page cache is completely bypassed for reads and writes.
+No read-ahead takes place. Shared mmap is disabled by default. To allow shared
+mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply.
+
+In cached mode reads may be satisfied from the page cache, and data may be
+read-ahead by the kernel to fill the cache.  The cache is always kept consistent
+after any writes to the file.  All mmap modes are supported.
+
+The cached mode has two sub modes controlling how writes are handled.  The
+write-through mode is the default and is supported on all kernels.  The
+writeback-cache mode may be selected by the FUSE_WRITEBACK_CACHE flag in the
+FUSE_INIT reply.
+
+In write-through mode each write is immediately sent to userspace as one or more
+WRITE requests, as well as updating any cached pages (and caching previously
+uncached, but fully written pages).  No READ requests are ever sent for writes,
+so when an uncached page is partially written, the page is discarded.
+
+In writeback-cache mode (enabled by the FUSE_WRITEBACK_CACHE flag) writes go to
+the cache only, which means that the write(2) syscall can often complete very
+fast.  Dirty pages are written back implicitly (background writeback or page
+reclaim on memory pressure) or explicitly (invoked by close(2), fsync(2) and
+when the last ref to the file is being released on munmap(2)).  This mode
+assumes that all changes to the filesystem go through the FUSE kernel module
+(size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so
+it's generally not suitable for network filesystems.  If a partial page is
+written, then the page needs to be first read from userspace.  This means, that
+even for files opened for O_WRONLY it is possible that READ requests will be
+generated by the kernel.
diff --git a/Documentation/filesystems/fuse/fuse-passthrough.rst b/Documentation/filesystems/fuse/fuse-passthrough.rst
new file mode 100644
index 000000000000..2b0e7c2da54a
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse-passthrough.rst
@@ -0,0 +1,133 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+FUSE Passthrough
+================
+
+Introduction
+============
+
+FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the
+performance of FUSE filesystems for I/O operations. Typically, FUSE operations
+involve communication between the kernel and a userspace FUSE daemon, which can
+incur overhead. Passthrough allows certain operations on a FUSE file to bypass
+the userspace daemon and be executed directly by the kernel on an underlying
+"backing file".
+
+This is achieved by the FUSE daemon registering a file descriptor (pointing to
+the backing file on a lower filesystem) with the FUSE kernel module. The kernel
+then receives an identifier (``backing_id``) for this registered backing file.
+When a FUSE file is subsequently opened, the FUSE daemon can, in its response to
+the ``OPEN`` request, include this ``backing_id`` and set the
+``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific
+operations.
+
+Currently, passthrough is supported for operations like ``read(2)``/``write(2)``
+(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``.
+
+Enabling Passthrough
+====================
+
+To use FUSE passthrough:
+
+  1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH``
+     enabled.
+  2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the
+     ``FUSE_PASSTHROUGH`` capability and specify its desired
+     ``max_stack_depth``.
+  3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl
+     on its connection file descriptor (e.g., ``/dev/fuse``) to register a
+     backing file descriptor and obtain a ``backing_id``.
+  4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon
+     replies with the ``FOPEN_PASSTHROUGH`` flag set in
+     ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id``
+     in ``fuse_open_out::backing_id``.
+  5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with
+     the ``backing_id`` to release the kernel's reference to the backing file
+     when it's no longer needed for passthrough setups.
+
+Privilege Requirements
+======================
+
+Setting up passthrough functionality currently requires the FUSE daemon to
+possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several
+security and resource management considerations that are actively being
+discussed and worked on. The primary reasons for this restriction are detailed
+below.
+
+Resource Accounting and Visibility
+----------------------------------
+
+The core mechanism for passthrough involves the FUSE daemon opening a file
+descriptor to a backing file and registering it with the FUSE kernel module via
+the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id``
+associated with a kernel-internal ``struct fuse_backing`` object, which holds a
+reference to the backing ``struct file``.
+
+A significant concern arises because the FUSE daemon can close its own file
+descriptor to the backing file after registration. The kernel, however, will
+still hold a reference to the ``struct file`` via the ``struct fuse_backing``
+object as long as it's associated with a ``backing_id`` (or subsequently, with
+an open FUSE file in passthrough mode).
+
+This behavior leads to two main issues for unprivileged FUSE daemons:
+
+  1. **Invisibility to lsof and other inspection tools**: Once the FUSE
+     daemon closes its file descriptor, the open backing file held by the kernel
+     becomes "hidden." Standard tools like ``lsof``, which typically inspect
+     process file descriptor tables, would not be able to identify that this
+     file is still open by the system on behalf of the FUSE filesystem. This
+     makes it difficult for system administrators to track resource usage or
+     debug issues related to open files (e.g., preventing unmounts).
+
+  2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to
+     resource limits, including the maximum number of open file descriptors
+     (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files
+     and then close its own FDs, it could potentially cause the kernel to hold
+     an unlimited number of open ``struct file`` references without these being
+     accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a
+     denial-of-service (DoS) by exhausting system-wide file resources.
+
+The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues,
+restricting this powerful capability to trusted processes.
+
+**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files",
+which are visible via ``fdinfo`` and accounted under the registering user's
+``RLIMIT_NOFILE``.
+
+Filesystem Stacking and Shutdown Loops
+--------------------------------------
+
+Another concern relates to the potential for creating complex and problematic
+filesystem stacking scenarios if unprivileged users could set up passthrough.
+A FUSE passthrough filesystem might use a backing file that resides:
+
+  * On the *same* FUSE filesystem.
+  * On another filesystem (like OverlayFS) which itself might have an upper or
+    lower layer that is a FUSE filesystem.
+
+These configurations could create dependency loops, particularly during
+filesystem shutdown or unmount sequences, leading to deadlocks or system
+instability. This is conceptually similar to the risks associated with the
+``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``.
+
+To mitigate this, FUSE passthrough already incorporates checks based on
+filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``).
+For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate
+the ``max_stack_depth`` it supports. When a backing file is registered via
+``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's
+filesystem stack depth is within the allowed limit.
+
+The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security,
+ensuring that only privileged users can create these potentially complex
+stacking arrangements.
+
+General Security Posture
+------------------------
+
+As a general principle for new kernel features that allow userspace to instruct
+the kernel to perform direct operations on its behalf based on user-provided
+file descriptors, starting with a higher privilege requirement (like
+``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows
+the feature to be used and tested while further security implications are
+evaluated and addressed.
diff --git a/Documentation/filesystems/fuse/fuse.rst b/Documentation/filesystems/fuse/fuse.rst
new file mode 100644
index 000000000000..0fbd5a03fdc9
--- /dev/null
+++ b/Documentation/filesystems/fuse/fuse.rst
@@ -0,0 +1,440 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+FUSE Overview
+=============
+
+Definitions
+===========
+
+Userspace filesystem:
+  A filesystem in which data and metadata are provided by an ordinary
+  userspace process.  The filesystem can be accessed normally through
+  the kernel interface.
+
+Filesystem daemon:
+  The process(es) providing the data and metadata of the filesystem.
+
+Non-privileged mount (or user mount):
+  A userspace filesystem mounted by a non-privileged (non-root) user.
+  The filesystem daemon is running with the privileges of the mounting
+  user.  NOTE: this is not the same as mounts allowed with the "user"
+  option in /etc/fstab, which is not discussed here.
+
+Filesystem connection:
+  A connection between the filesystem daemon and the kernel.  The
+  connection exists until either the daemon dies, or the filesystem is
+  umounted.  Note that detaching (or lazy umounting) the filesystem
+  does *not* break the connection, in this case it will exist until
+  the last reference to the filesystem is released.
+
+Mount owner:
+  The user who does the mounting.
+
+User:
+  The user who is performing filesystem operations.
+
+What is FUSE?
+=============
+
+FUSE is a userspace filesystem framework.  It consists of a kernel
+module (fuse.ko), a userspace library (libfuse.*) and a mount utility
+(fusermount).
+
+One of the most important features of FUSE is allowing secure,
+non-privileged mounts.  This opens up new possibilities for the use of
+filesystems.  A good example is sshfs: a secure network filesystem
+using the sftp protocol.
+
+The userspace library and utilities are available from the
+`FUSE homepage: <https://github.com/libfuse/>`_
+
+Filesystem type
+===============
+
+The filesystem type given to mount(2) can be one of the following:
+
+    fuse
+      This is the usual way to mount a FUSE filesystem.  The first
+      argument of the mount system call may contain an arbitrary string,
+      which is not interpreted by the kernel.
+
+    fuseblk
+      The filesystem is block device based.  The first argument of the
+      mount system call is interpreted as the name of the device.
+
+Mount options
+=============
+
+fd=N
+  The file descriptor to use for communication between the userspace
+  filesystem and the kernel.  The file descriptor must have been
+  obtained by opening the FUSE device ('/dev/fuse').
+
+rootmode=M
+  The file mode of the filesystem's root in octal representation.
+
+user_id=N
+  The numeric user id of the mount owner.
+
+group_id=N
+  The numeric group id of the mount owner.
+
+default_permissions
+  By default FUSE doesn't check file access permissions, the
+  filesystem is free to implement its access policy or leave it to
+  the underlying file access mechanism (e.g. in case of network
+  filesystems).  This option enables permission checking, restricting
+  access based on file mode.  It is usually useful together with the
+  'allow_other' mount option.
+
+allow_other
+  This option overrides the security measure restricting file access
+  to the user mounting the filesystem.  This option is by default only
+  allowed to root, but this restriction can be removed with a
+  (userspace) configuration option.
+
+max_read=N
+  With this option the maximum size of read operations can be set.
+  The default is infinite.  Note that the size of read requests is
+  limited anyway to 32 pages (which is 128kbyte on i386).
+
+blksize=N
+  Set the block size for the filesystem.  The default is 512.  This
+  option is only valid for 'fuseblk' type mounts.
+
+Control filesystem
+==================
+
+There's a control filesystem for FUSE, which can be mounted by::
+
+  mount -t fusectl none /sys/fs/fuse/connections
+
+Mounting it under the '/sys/fs/fuse/connections' directory makes it
+backwards compatible with earlier versions.
+
+Under the fuse control filesystem each connection has a directory
+named by a unique number.
+
+For each connection the following files exist within this directory:
+
+	waiting
+	  The number of requests which are waiting to be transferred to
+	  userspace or being processed by the filesystem daemon.  If there is
+	  no filesystem activity and 'waiting' is non-zero, then the
+	  filesystem is hung or deadlocked.
+
+	abort
+	  Writing anything into this file will abort the filesystem
+	  connection.  This means that all waiting requests will be aborted an
+	  error returned for all aborted and new requests.
+
+        max_background
+          The maximum number of background requests that can be outstanding
+          at a time. When the number of background requests reaches this limit,
+          further requests will be blocked until some are completed, potentially
+          causing I/O operations to stall.
+
+        congestion_threshold
+          The threshold of background requests at which the kernel considers
+          the filesystem to be congested. When the number of background requests
+          exceeds this value, the kernel will skip asynchronous readahead
+          operations, reducing read-ahead optimizations but preserving essential
+          I/O, as well as suspending non-synchronous writeback operations
+          (WB_SYNC_NONE), delaying page cache flushing to the filesystem.
+
+Only the owner of the mount may read or write these files.
+
+Interrupting filesystem operations
+##################################
+
+If a process issuing a FUSE filesystem request is interrupted, the
+following will happen:
+
+  -  If the request is not yet sent to userspace AND the signal is
+     fatal (SIGKILL or unhandled fatal signal), then the request is
+     dequeued and returns immediately.
+
+  -  If the request is not yet sent to userspace AND the signal is not
+     fatal, then an interrupted flag is set for the request.  When
+     the request has been successfully transferred to userspace and
+     this flag is set, an INTERRUPT request is queued.
+
+  -  If the request is already sent to userspace, then an INTERRUPT
+     request is queued.
+
+INTERRUPT requests take precedence over other requests, so the
+userspace filesystem will receive queued INTERRUPTs before any others.
+
+The userspace filesystem may ignore the INTERRUPT requests entirely,
+or may honor them by sending a reply to the *original* request, with
+the error set to EINTR.
+
+It is also possible that there's a race between processing the
+original request and its INTERRUPT request.  There are two possibilities:
+
+  1. The INTERRUPT request is processed before the original request is
+     processed
+
+  2. The INTERRUPT request is processed after the original request has
+     been answered
+
+If the filesystem cannot find the original request, it should wait for
+some timeout and/or a number of new requests to arrive, after which it
+should reply to the INTERRUPT request with an EAGAIN error.  In case
+1) the INTERRUPT request will be requeued.  In case 2) the INTERRUPT
+reply will be ignored.
+
+Aborting a filesystem connection
+================================
+
+It is possible to get into certain situations where the filesystem is
+not responding.  Reasons for this may be:
+
+  a) Broken userspace filesystem implementation
+
+  b) Network connection down
+
+  c) Accidental deadlock
+
+  d) Malicious deadlock
+
+(For more on c) and d) see later sections)
+
+In either of these cases it may be useful to abort the connection to
+the filesystem.  There are several ways to do this:
+
+  - Kill the filesystem daemon.  Works in case of a) and b)
+
+  - Kill the filesystem daemon and all users of the filesystem.  Works
+    in all cases except some malicious deadlocks
+
+  - Use forced umount (umount -f).  Works in all cases but only if
+    filesystem is still attached (it hasn't been lazy unmounted)
+
+  - Abort filesystem through the FUSE control filesystem.  Most
+    powerful method, always works.
+
+How do non-privileged mounts work?
+==================================
+
+Since the mount() system call is a privileged operation, a helper
+program (fusermount) is needed, which is installed setuid root.
+
+The implication of providing non-privileged mounts is that the mount
+owner must not be able to use this capability to compromise the
+system.  Obvious requirements arising from this are:
+
+ A) mount owner should not be able to get elevated privileges with the
+    help of the mounted filesystem
+
+ B) mount owner should not get illegitimate access to information from
+    other users' and the super user's processes
+
+ C) mount owner should not be able to induce undesired behavior in
+    other users' or the super user's processes
+
+How are requirements fulfilled?
+===============================
+
+ A) The mount owner could gain elevated privileges by either:
+
+    1. creating a filesystem containing a device file, then opening this device
+
+    2. creating a filesystem containing a suid or sgid application, then executing this application
+
+    The solution is not to allow opening device files and ignore
+    setuid and setgid bits when executing programs.  To ensure this
+    fusermount always adds "nosuid" and "nodev" to the mount options
+    for non-privileged mounts.
+
+ B) If another user is accessing files or directories in the
+    filesystem, the filesystem daemon serving requests can record the
+    exact sequence and timing of operations performed.  This
+    information is otherwise inaccessible to the mount owner, so this
+    counts as an information leak.
+
+    The solution to this problem will be presented in point 2) of C).
+
+ C) There are several ways in which the mount owner can induce
+    undesired behavior in other users' processes, such as:
+
+     1) mounting a filesystem over a file or directory which the mount
+        owner could otherwise not be able to modify (or could only
+        make limited modifications).
+
+        This is solved in fusermount, by checking the access
+        permissions on the mountpoint and only allowing the mount if
+        the mount owner can do unlimited modification (has write
+        access to the mountpoint, and mountpoint is not a "sticky"
+        directory)
+
+     2) Even if 1) is solved the mount owner can change the behavior
+        of other users' processes.
+
+         i) It can slow down or indefinitely delay the execution of a
+            filesystem operation creating a DoS against the user or the
+            whole system.  For example a suid application locking a
+            system file, and then accessing a file on the mount owner's
+            filesystem could be stopped, and thus causing the system
+            file to be locked forever.
+
+         ii) It can present files or directories of unlimited length, or
+             directory structures of unlimited depth, possibly causing a
+             system process to eat up diskspace, memory or other
+             resources, again causing *DoS*.
+
+	The solution to this as well as B) is not to allow processes
+	to access the filesystem, which could otherwise not be
+	monitored or manipulated by the mount owner.  Since if the
+	mount owner can ptrace a process, it can do all of the above
+	without using a FUSE mount, the same criteria as used in
+	ptrace can be used to check if a process is allowed to access
+	the filesystem or not.
+
+	Note that the *ptrace* check is not strictly necessary to
+	prevent C/2/i, it is enough to check if mount owner has enough
+	privilege to send signal to the process accessing the
+	filesystem, since *SIGSTOP* can be used to get a similar effect.
+
+I think these limitations are unacceptable?
+===========================================
+
+If a sysadmin trusts the users enough, or can ensure through other
+measures, that system processes will never enter non-privileged
+mounts, it can relax the last limitation in several ways:
+
+  - With the 'user_allow_other' config option. If this config option is
+    set, the mounting user can add the 'allow_other' mount option which
+    disables the check for other users' processes.
+
+    User namespaces have an unintuitive interaction with 'allow_other':
+    an unprivileged user - normally restricted from mounting with
+    'allow_other' - could do so in a user namespace where they're
+    privileged. If any process could access such an 'allow_other' mount
+    this would give the mounting user the ability to manipulate
+    processes in user namespaces where they're unprivileged. For this
+    reason 'allow_other' restricts access to users in the same userns
+    or a descendant.
+
+  - With the 'allow_sys_admin_access' module option. If this option is
+    set, super user's processes have unrestricted access to mounts
+    irrespective of allow_other setting or user namespace of the
+    mounting user.
+
+Note that both of these relaxations expose the system to potential
+information leak or *DoS* as described in points B and C/2/i-ii in the
+preceding section.
+
+Kernel - userspace interface
+============================
+
+The following diagram shows how a filesystem operation (in this
+example unlink) is performed in FUSE. ::
+
+
+ |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
+ |                                    |
+ |                                    |  >sys_read()
+ |                                    |    >fuse_dev_read()
+ |                                    |      >request_wait()
+ |                                    |        [sleep on fc->waitq]
+ |                                    |
+ |  >sys_unlink()                     |
+ |    >fuse_unlink()                  |
+ |      [get request from             |
+ |       fc->unused_list]             |
+ |      >request_send()               |
+ |        [queue req on fc->pending]  |
+ |        [wake up fc->waitq]         |        [woken up]
+ |        >request_wait_answer()      |
+ |          [sleep on req->waitq]     |
+ |                                    |      <request_wait()
+ |                                    |      [remove req from fc->pending]
+ |                                    |      [copy req to read buffer]
+ |                                    |      [add req to fc->processing]
+ |                                    |    <fuse_dev_read()
+ |                                    |  <sys_read()
+ |                                    |
+ |                                    |  [perform unlink]
+ |                                    |
+ |                                    |  >sys_write()
+ |                                    |    >fuse_dev_write()
+ |                                    |      [look up req in fc->processing]
+ |                                    |      [remove from fc->processing]
+ |                                    |      [copy write buffer to req]
+ |          [woken up]                |      [wake up req->waitq]
+ |                                    |    <fuse_dev_write()
+ |                                    |  <sys_write()
+ |        <request_wait_answer()      |
+ |      <request_send()               |
+ |      [add request to               |
+ |       fc->unused_list]             |
+ |    <fuse_unlink()                  |
+ |  <sys_unlink()                     |
+
+.. note:: Everything in the description above is greatly simplified
+
+There are a couple of ways in which to deadlock a FUSE filesystem.
+Since we are talking about unprivileged userspace programs,
+something must be done about these.
+
+**Scenario 1 -  Simple deadlock**::
+
+ |  "rm /mnt/fuse/file"               |  FUSE filesystem daemon
+ |                                    |
+ |  >sys_unlink("/mnt/fuse/file")     |
+ |    [acquire inode semaphore        |
+ |     for "file"]                    |
+ |    >fuse_unlink()                  |
+ |      [sleep on req->waitq]         |
+ |                                    |  <sys_read()
+ |                                    |  >sys_unlink("/mnt/fuse/file")
+ |                                    |    [acquire inode semaphore
+ |                                    |     for "file"]
+ |                                    |    *DEADLOCK*
+
+The solution for this is to allow the filesystem to be aborted.
+
+**Scenario 2 - Tricky deadlock**
+
+
+This one needs a carefully crafted filesystem.  It's a variation on
+the above, only the call back to the filesystem is not explicit,
+but is caused by a pagefault. ::
+
+ |  Kamikaze filesystem thread 1      |  Kamikaze filesystem thread 2
+ |                                    |
+ |  [fd = open("/mnt/fuse/file")]     |  [request served normally]
+ |  [mmap fd to 'addr']               |
+ |  [close fd]                        |  [FLUSH triggers 'magic' flag]
+ |  [read a byte from addr]           |
+ |    >do_page_fault()                |
+ |      [find or create page]         |
+ |      [lock page]                   |
+ |      >fuse_readpage()              |
+ |         [queue READ request]       |
+ |         [sleep on req->waitq]      |
+ |                                    |  [read request to buffer]
+ |                                    |  [create reply header before addr]
+ |                                    |  >sys_write(addr - headerlength)
+ |                                    |    >fuse_dev_write()
+ |                                    |      [look up req in fc->processing]
+ |                                    |      [remove from fc->processing]
+ |                                    |      [copy write buffer to req]
+ |                                    |        >do_page_fault()
+ |                                    |           [find or create page]
+ |                                    |           [lock page]
+ |                                    |           * DEADLOCK *
+
+The solution is basically the same as above.
+
+An additional problem is that while the write buffer is being copied
+to the request, the request must not be interrupted/aborted.  This is
+because the destination address of the copy may not be valid after the
+request has returned.
+
+This is solved with doing the copy atomically, and allowing abort
+while the page(s) belonging to the write buffer are faulted with
+get_user_pages().  The 'req->locked' flag indicates when the copy is
+taking place, and abort is delayed until this flag is unset.
diff --git a/Documentation/filesystems/fuse/index.rst b/Documentation/filesystems/fuse/index.rst
new file mode 100644
index 000000000000..393a845214da
--- /dev/null
+++ b/Documentation/filesystems/fuse/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+FUSE (Filesystem in Userspace) Technical Documentation
+======================================================
+
+.. toctree::
+   :maxdepth: 2
+   :numbered:
+
+   fuse
+   fuse-io
+   fuse-io-uring
+   fuse-passthrough