diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/fuse-passthrough.rst | 133 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 5 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/relay.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/smb/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/smb/smbdirect.rst | 103 |
8 files changed, 248 insertions, 16 deletions
diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse-passthrough.rst new file mode 100644 index 000000000000..2b0e7c2da54a --- /dev/null +++ b/Documentation/filesystems/fuse-passthrough.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +FUSE Passthrough +================ + +Introduction +============ + +FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the +performance of FUSE filesystems for I/O operations. Typically, FUSE operations +involve communication between the kernel and a userspace FUSE daemon, which can +incur overhead. Passthrough allows certain operations on a FUSE file to bypass +the userspace daemon and be executed directly by the kernel on an underlying +"backing file". + +This is achieved by the FUSE daemon registering a file descriptor (pointing to +the backing file on a lower filesystem) with the FUSE kernel module. The kernel +then receives an identifier (``backing_id``) for this registered backing file. +When a FUSE file is subsequently opened, the FUSE daemon can, in its response to +the ``OPEN`` request, include this ``backing_id`` and set the +``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific +operations. + +Currently, passthrough is supported for operations like ``read(2)``/``write(2)`` +(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``. + +Enabling Passthrough +==================== + +To use FUSE passthrough: + + 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH`` + enabled. + 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the + ``FUSE_PASSTHROUGH`` capability and specify its desired + ``max_stack_depth``. + 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl + on its connection file descriptor (e.g., ``/dev/fuse``) to register a + backing file descriptor and obtain a ``backing_id``. + 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon + replies with the ``FOPEN_PASSTHROUGH`` flag set in + ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id`` + in ``fuse_open_out::backing_id``. + 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with + the ``backing_id`` to release the kernel's reference to the backing file + when it's no longer needed for passthrough setups. + +Privilege Requirements +====================== + +Setting up passthrough functionality currently requires the FUSE daemon to +possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several +security and resource management considerations that are actively being +discussed and worked on. The primary reasons for this restriction are detailed +below. + +Resource Accounting and Visibility +---------------------------------- + +The core mechanism for passthrough involves the FUSE daemon opening a file +descriptor to a backing file and registering it with the FUSE kernel module via +the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id`` +associated with a kernel-internal ``struct fuse_backing`` object, which holds a +reference to the backing ``struct file``. + +A significant concern arises because the FUSE daemon can close its own file +descriptor to the backing file after registration. The kernel, however, will +still hold a reference to the ``struct file`` via the ``struct fuse_backing`` +object as long as it's associated with a ``backing_id`` (or subsequently, with +an open FUSE file in passthrough mode). + +This behavior leads to two main issues for unprivileged FUSE daemons: + + 1. **Invisibility to lsof and other inspection tools**: Once the FUSE + daemon closes its file descriptor, the open backing file held by the kernel + becomes "hidden." Standard tools like ``lsof``, which typically inspect + process file descriptor tables, would not be able to identify that this + file is still open by the system on behalf of the FUSE filesystem. This + makes it difficult for system administrators to track resource usage or + debug issues related to open files (e.g., preventing unmounts). + + 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to + resource limits, including the maximum number of open file descriptors + (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files + and then close its own FDs, it could potentially cause the kernel to hold + an unlimited number of open ``struct file`` references without these being + accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a + denial-of-service (DoS) by exhausting system-wide file resources. + +The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues, +restricting this powerful capability to trusted processes. + +**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files", +which are visible via ``fdinfo`` and accounted under the registering user's +``RLIMIT_NOFILE``. + +Filesystem Stacking and Shutdown Loops +-------------------------------------- + +Another concern relates to the potential for creating complex and problematic +filesystem stacking scenarios if unprivileged users could set up passthrough. +A FUSE passthrough filesystem might use a backing file that resides: + + * On the *same* FUSE filesystem. + * On another filesystem (like OverlayFS) which itself might have an upper or + lower layer that is a FUSE filesystem. + +These configurations could create dependency loops, particularly during +filesystem shutdown or unmount sequences, leading to deadlocks or system +instability. This is conceptually similar to the risks associated with the +``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``. + +To mitigate this, FUSE passthrough already incorporates checks based on +filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``). +For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate +the ``max_stack_depth`` it supports. When a backing file is registered via +``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's +filesystem stack depth is within the allowed limit. + +The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security, +ensuring that only privileged users can create these potentially complex +stacking arrangements. + +General Security Posture +------------------------ + +As a general principle for new kernel features that allow userspace to instruct +the kernel to perform direct operations on its behalf based on user-provided +file descriptors, starting with a higher privilege requirement (like +``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows +the feature to be used and tested while further security implications are +evaluated and addressed. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 32618512a965..11a599387266 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -99,6 +99,7 @@ Documentation for filesystem implementations. fuse fuse-io fuse-io-uring + fuse-passthrough inotify isofs nilfs2 diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 939b4b624fad..ddd799df6ce3 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -712,11 +712,6 @@ handle falling back from one source type to another. The members are: at a boundary with the filesystem structure (e.g. at the end of a Ceph object). It tells netfslib not to retile subrequests across it. - * ``NETFS_SREQ_SEEK_DATA_READ`` - - This is a hint from netfslib to the cache that it might want to try - skipping ahead to the next data (ie. using SEEK_DATA). - * ``error`` This is for the filesystem to store result of the subrequest. It should be diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 2db379b4b31e..4133a336486d 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -443,6 +443,13 @@ Only the data of the files in the "data-only" lower layers may be visible when a "metacopy" file in one of the lower layers above it, has a "redirect" to the absolute path of the "lower data" file in the "data-only" lower layer. +Instead of explicitly enabling "metacopy=on" it is sufficient to specify at +least one data-only layer to enable redirection of data to a data-only layer. +In this case other forms of metacopy are rejected. Note: this way data-only +layers may be used toghether with "userxattr", in which case careful attention +must be given to privileges needed to change the "user.overlay.redirect" xattr +to prevent misuse. + Since kernel version v6.8, "data-only" lower layers can also be added using the "datadir+" mount options and the fsconfig syscall from new mount api. For example:: diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2a17865dfe39..5236cb52e357 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -584,7 +584,6 @@ encoded manner. The codes are the following: ms may share gd stack segment growns down pf pure PFN range - dw disabled write to the mapped file lo pages are locked in memory io memory mapped I/O area sr sequential read advise provided @@ -607,8 +606,11 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking + ui userfaultfd minor fault ss shadow/guarded control stack page sl sealed + lf lock on fault pages + dp always lazily freeable mapping == ======================================= Note that there is no guarantee that every flag and associated mnemonic will diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst index 46447dbc75ad..301ff4c6e6c6 100644 --- a/Documentation/filesystems/relay.rst +++ b/Documentation/filesystems/relay.rst @@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available (including in create_buf_file()) via chan->private_data or buf->chan->private_data. -Buffer-only channels --------------------- - -These channels have no files associated and can be created with -relay_open(NULL, NULL, ...). Such channels are useful in scenarios such -as when doing early tracing in the kernel, before the VFS is up. In these -cases, one may open a buffer-only channel and then call -relay_late_setup_files() when the kernel is ready to handle files, -to expose the buffered data to the userspace. - Channel 'modes' --------------- diff --git a/Documentation/filesystems/smb/index.rst b/Documentation/filesystems/smb/index.rst index 1c8597a679ab..6df23b0e45c8 100644 --- a/Documentation/filesystems/smb/index.rst +++ b/Documentation/filesystems/smb/index.rst @@ -8,3 +8,4 @@ CIFS ksmbd cifsroot + smbdirect diff --git a/Documentation/filesystems/smb/smbdirect.rst b/Documentation/filesystems/smb/smbdirect.rst new file mode 100644 index 000000000000..ca6927c0b2c0 --- /dev/null +++ b/Documentation/filesystems/smb/smbdirect.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +SMB Direct - SMB3 over RDMA +=========================== + +This document describes how to set up the Linux SMB client and server to +use RDMA. + +Overview +======== +The Linux SMB kernel client supports SMB Direct, which is a transport +scheme for SMB3 that uses RDMA (Remote Direct Memory Access) to provide +high throughput and low latencies by bypassing the traditional TCP/IP +stack. +SMB Direct on the Linux SMB client can be tested against KSMBD - a +kernel-space SMB server. + +Installation +============= +- Install an RDMA device. As long as the RDMA device driver is supported + by the kernel, it should work. This includes both software emulators (soft + RoCE, soft iWARP) and hardware devices (InfiniBand, RoCE, iWARP). + +- Install a kernel with SMB Direct support. The first kernel release to + support SMB Direct on both the client and server side is 5.15. Therefore, + a distribution compatible with kernel 5.15 or later is required. + +- Install cifs-utils, which provides the `mount.cifs` command to mount SMB + shares. + +- Configure the RDMA stack + + Make sure that your kernel configuration has RDMA support enabled. Under + Device Drivers -> Infiniband support, update the kernel configuration to + enable Infiniband support. + + Enable the appropriate IB HCA support or iWARP adapter support, + depending on your hardware. + + If you are using InfiniBand, enable IP-over-InfiniBand support. + + For soft RDMA, enable either the soft iWARP (`RDMA _SIW`) or soft RoCE + (`RDMA_RXE`) module. Install the `iproute2` package and use the + `rdma link add` command to load the module and create an + RDMA interface. + + e.g. if your local ethernet interface is `eth0`, you can use: + + .. code-block:: bash + + sudo rdma link add siw0 type siw netdev eth0 + +- Enable SMB Direct support for both the server and the client in the kernel + configuration. + + Server Setup + + .. code-block:: text + + Network File Systems ---> + <M> SMB3 server support + [*] Support for SMB Direct protocol + + Client Setup + + .. code-block:: text + + Network File Systems ---> + <M> SMB3 and CIFS support (advanced network filesystem) + [*] SMB Direct support + +- Build and install the kernel. SMB Direct support will be enabled in the + cifs.ko and ksmbd.ko modules. + +Setup and Usage +================ + +- Set up and start a KSMBD server as described in the `KSMBD documentation + <https://www.kernel.org/doc/Documentation/filesystems/smb/ksmbd.rst>`_. + Also add the "server multi channel support = yes" parameter to ksmbd.conf. + +- On the client, mount the share with `rdma` mount option to use SMB Direct + (specify a SMB version 3.0 or higher using `vers`). + + For example: + + .. code-block:: bash + + mount -t cifs //server/share /mnt/point -o vers=3.1.1,rdma + +- To verify that the mount is using SMB Direct, you can check dmesg for the + following log line after mounting: + + .. code-block:: text + + CIFS: VFS: RDMA transport established + + Or, verify `rdma` mount option for the share in `/proc/mounts`: + + .. code-block:: bash + + cat /proc/mounts | grep cifs |