diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/debugfs.rst | 19 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/atomic_writes.rst | 225 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/overview.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 52 | ||||
-rw-r--r-- | Documentation/filesystems/fuse-passthrough.rst | 133 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 5 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 15 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/relay.rst | 36 | ||||
-rw-r--r-- | Documentation/filesystems/resctrl.rst | 1523 | ||||
-rw-r--r-- | Documentation/filesystems/smb/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/smb/smbdirect.rst | 103 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 4 |
15 files changed, 2060 insertions, 70 deletions
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst index 610f718ef8b5..55f807293924 100644 --- a/Documentation/filesystems/debugfs.rst +++ b/Documentation/filesystems/debugfs.rst @@ -229,22 +229,15 @@ module is unloaded without explicitly removing debugfs entries, the result will be a lot of stale pointers and no end of highly antisocial behavior. So all debugfs users - at least those which can be built as modules - must be prepared to remove all files and directories they create there. A file -can be removed with:: +or directory can be removed with:: void debugfs_remove(struct dentry *dentry); The dentry value can be NULL or an error value, in which case nothing will -be removed. - -Once upon a time, debugfs users were required to remember the dentry -pointer for every debugfs file they created so that all files could be -cleaned up. We live in more civilized times now, though, and debugfs users -can call:: - - void debugfs_remove_recursive(struct dentry *dentry); - -If this function is passed a pointer for the dentry corresponding to the -top-level directory, the entire hierarchy below that directory will be -removed. +be removed. Note that this function will recursively remove all files and +directories underneath it. Previously, debugfs_remove_recursive() was used +to perform that task, but this function is now just an alias to +debugfs_remove(). debugfs_remove_recursive() should be considered +deprecated. .. [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst new file mode 100644 index 000000000000..f65767df3620 --- /dev/null +++ b/Documentation/filesystems/ext4/atomic_writes.rst @@ -0,0 +1,225 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _atomic_writes: + +Atomic Block Writes +------------------------- + +Introduction +~~~~~~~~~~~~ + +Atomic (untorn) block writes ensure that either the entire write is committed +to disk or none of it is. This prevents "torn writes" during power loss or +system crashes. The ext4 filesystem supports atomic writes (only with Direct +I/O) on regular files with extents, provided the underlying storage device +supports hardware atomic writes. This is supported in the following two ways: + +1. **Single-fsblock Atomic Writes**: + EXT4's supports atomic write operations with a single filesystem block since + v6.13. In this the atomic write unit minimum and maximum sizes are both set + to filesystem blocksize. + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB + pagesize system is possible. + +2. **Multi-fsblock Atomic Writes with Bigalloc**: + EXT4 now also supports atomic writes spanning multiple filesystem blocks + using a feature known as bigalloc. The atomic write unit's minimum and + maximum sizes are determined by the filesystem block size and cluster size, + based on the underlying device’s supported atomic write unit limits. + +Requirements +~~~~~~~~~~~~ + +Basic requirements for atomic writes in ext4: + + 1. The extents feature must be enabled (default for ext4) + 2. The underlying block device must support atomic writes + 3. For single-fsblock atomic writes: + + 1. A filesystem with appropriate block size (up to the page size) + 4. For multi-fsblock atomic writes: + + 1. The bigalloc feature must be enabled + 2. The cluster size must be appropriately configured + +NOTE: EXT4 does not support software or COW based atomic write, which means +atomic writes on ext4 are only supported if underlying storage device supports +it. + +Multi-fsblock Implementation Details +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The bigalloc feature changes ext4 to allocate in units of multiple filesystem +blocks, also known as clusters. With bigalloc each bit within block bitmap +represents cluster (power of 2 number of blocks) rather than individual +filesystem blocks. +EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the +following constraints. The minimum atomic write size is the larger of the fs +block size and the minimum hardware atomic write unit; and the maximum atomic +write size is smaller of the bigalloc cluster size and the maximum hardware +atomic write unit. Bigalloc ensures that all allocations are aligned to the +cluster size, which satisfies the LBA alignment requirements of the hardware +device if the start of the partition/logical volume is itself aligned correctly. + +Here is the block allocation strategy in bigalloc for atomic writes: + + * For regions with fully mapped extents, no additional work is needed + * For append writes, a new mapped extent is allocated + * For regions that are entirely holes, unwritten extent is created + * For large unwritten extents, the extent gets split into two unwritten + extents of appropriate requested size + * For mixed mapping regions (combinations of holes, unwritten extents, or + mapped extents), ext4_map_blocks() is called in a loop with + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous + mapped extent by writing zeroes to it and converting any unwritten extents to + written, if found within the range. + +Note: Writing on a single contiguous underlying extent, whether mapped or +unwritten, is not inherently problematic. However, writing to a mixed mapping +region (i.e. one containing a combination of mapped and unwritten extents) +must be avoided when performing atomic writes. + +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC +flag, requires that either all data is written or none at all. In the event of +a system crash or unexpected power loss during the write operation, the affected +region (when later read) must reflect either the complete old data or the +complete new data, but never a mix of both. + +To enforce this guarantee, we ensure that the write target is backed by +a single, contiguous extent before any data is written. This is critical because +ext4 defers the conversion of unwritten extents to written extents until the I/O +completion path (typically in ->end_io()). If a write is allowed to proceed over +a mixed mapping region (with mapped and unwritten extents) and a failure occurs +mid-write, the system could observe partially updated regions after reboot, i.e. +new data over mapped areas, and stale (old) data over unwritten extents that +were never marked written. This violates the atomicity and/or torn write +prevention guarantee. + +To prevent such torn writes, ext4 proactively allocates a single contiguous +extent for the entire requested region in ``ext4_iomap_alloc`` via +``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling +transaction in case if allocation is done over mixed mapping. This ensures any +pending metadata updates (like unwritten to written extents conversion) in this +range are in consistent state with the file data blocks, before performing the +actual write I/O. If the commit fails, the whole I/O must be aborted to prevent +from any possible torn writes. +Only after this step, the actual data write operation is performed by the iomap. + +Handling Split Extents Across Leaf Blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be a special edge case where we have logically and physically +contiguous extents stored in separate leaf nodes of the on-disk extent tree. +This occurs because on-disk extent tree merges only happens within the leaf +blocks except for a case where we have 2-level tree which can get merged and +collapsed entirely into the inode. +If such a layout exists and, in the worst case, the extent status cache entries +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return +a single contiguous extent for these split leaf extents. + +To address this edge case, a new get block flag +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the +``ext4_map_query_blocks()`` lookup behavior. + +This new get block flag allows ``ext4_map_blocks()`` to first check if there is +an entry in the extent status cache for the full range. +If not present, it consults the on-disk extent tree using +``ext4_map_query_blocks()``. +If the located extent is at the end of a leaf node, it probes the next logical +block (lblk) to detect a contiguous extent in the adjacent leaf. + +For now only one additional leaf block is queried to maintain efficiency, as +atomic writes are typically constrained to small sizes +(e.g. [blocksize, clustersize]). + + +Handling Journal transactions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To support multi-fsblock atomic writes, we ensure enough journal credits are +reserved during: + + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there + could be a mixed mapping for the underlying requested range. If yes, then we + reserve credits of up to ``m_len``, assuming every alternate block can be + an unwritten extent followed by a hole. + + 2. During ``->end_io()`` call, we make sure a single transaction is started for + doing unwritten-to-written conversion. The loop for conversion is mainly + only required to handle a split extent across leaf blocks. + +How to +------ + +Creating Filesystems with Atomic Write Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First check the atomic write units supported by block device. +See :ref:`atomic_write_bdev_support` for more details. + +For single-fsblock atomic writes with a larger block size +(on systems with block size < page size): + +.. code-block:: bash + + # Create an ext4 filesystem with a 16KB block size + # (requires page size >= 16KB) + mkfs.ext4 -b 16384 /dev/device + +For multi-fsblock atomic writes with bigalloc: + +.. code-block:: bash + + # Create an ext4 filesystem with bigalloc and 64KB cluster size + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device + +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, +and ``-O bigalloc`` enables the bigalloc feature. + +Application Interface +~~~~~~~~~~~~~~~~~~~~~ + +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag +to perform atomic writes: + +.. code-block:: c + + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); + +The write must be aligned to the filesystem's block size and not exceed the +filesystem's maximum atomic write unit size. +See ``generic_atomic_write_valid()`` for more details. + +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following +details: + + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of + separate memory buffers that can be gathered into a write operation + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. + +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic +writes are supported. + +.. _atomic_write_bdev_support: + +Hardware Support +---------------- + +The underlying storage device must support atomic write operations. +Modern NVMe and SCSI devices often provide this capability. +The Linux kernel exposes this information through sysfs: + +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size + +Nonzero values for these attributes indicate that the device supports +atomic writes. + +See Also +-------- + +* :doc:`bigalloc` - Documentation on the bigalloc feature +* :doc:`allocators` - Documentation on block allocation in ext4 +* Support for atomic block writes in 6.13: + https://lwn.net/Articles/1009298/ diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst index 0fad6eda6e15..9d4054c17ecb 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -25,3 +25,4 @@ order. .. include:: inlinedata.rst .. include:: eainode.rst .. include:: verity.rst +.. include:: atomic_writes.rst diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index e15c4275862a..440e4ae74e44 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -182,32 +182,34 @@ fault_type=%d Support configuring fault injection type, should be enabled with fault_injection option, fault type value is shown below, it supports single or combined type. - =========================== =========== + =========================== ========== Type_Name Type_Value - =========================== =========== - FAULT_KMALLOC 0x000000001 - FAULT_KVMALLOC 0x000000002 - FAULT_PAGE_ALLOC 0x000000004 - FAULT_PAGE_GET 0x000000008 - FAULT_ALLOC_BIO 0x000000010 (obsolete) - FAULT_ALLOC_NID 0x000000020 - FAULT_ORPHAN 0x000000040 - FAULT_BLOCK 0x000000080 - FAULT_DIR_DEPTH 0x000000100 - FAULT_EVICT_INODE 0x000000200 - FAULT_TRUNCATE 0x000000400 - FAULT_READ_IO 0x000000800 - FAULT_CHECKPOINT 0x000001000 - FAULT_DISCARD 0x000002000 - FAULT_WRITE_IO 0x000004000 - FAULT_SLAB_ALLOC 0x000008000 - FAULT_DQUOT_INIT 0x000010000 - FAULT_LOCK_OP 0x000020000 - FAULT_BLKADDR_VALIDITY 0x000040000 - FAULT_BLKADDR_CONSISTENCE 0x000080000 - FAULT_NO_SEGMENT 0x000100000 - FAULT_INCONSISTENT_FOOTER 0x000200000 - =========================== =========== + =========================== ========== + FAULT_KMALLOC 0x00000001 + FAULT_KVMALLOC 0x00000002 + FAULT_PAGE_ALLOC 0x00000004 + FAULT_PAGE_GET 0x00000008 + FAULT_ALLOC_BIO 0x00000010 (obsolete) + FAULT_ALLOC_NID 0x00000020 + FAULT_ORPHAN 0x00000040 + FAULT_BLOCK 0x00000080 + FAULT_DIR_DEPTH 0x00000100 + FAULT_EVICT_INODE 0x00000200 + FAULT_TRUNCATE 0x00000400 + FAULT_READ_IO 0x00000800 + FAULT_CHECKPOINT 0x00001000 + FAULT_DISCARD 0x00002000 + FAULT_WRITE_IO 0x00004000 + FAULT_SLAB_ALLOC 0x00008000 + FAULT_DQUOT_INIT 0x00010000 + FAULT_LOCK_OP 0x00020000 + FAULT_BLKADDR_VALIDITY 0x00040000 + FAULT_BLKADDR_CONSISTENCE 0x00080000 + FAULT_NO_SEGMENT 0x00100000 + FAULT_INCONSISTENT_FOOTER 0x00200000 + FAULT_TIMEOUT 0x00400000 (1000ms) + FAULT_VMALLOC 0x00800000 + =========================== ========== mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse-passthrough.rst new file mode 100644 index 000000000000..2b0e7c2da54a --- /dev/null +++ b/Documentation/filesystems/fuse-passthrough.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +FUSE Passthrough +================ + +Introduction +============ + +FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the +performance of FUSE filesystems for I/O operations. Typically, FUSE operations +involve communication between the kernel and a userspace FUSE daemon, which can +incur overhead. Passthrough allows certain operations on a FUSE file to bypass +the userspace daemon and be executed directly by the kernel on an underlying +"backing file". + +This is achieved by the FUSE daemon registering a file descriptor (pointing to +the backing file on a lower filesystem) with the FUSE kernel module. The kernel +then receives an identifier (``backing_id``) for this registered backing file. +When a FUSE file is subsequently opened, the FUSE daemon can, in its response to +the ``OPEN`` request, include this ``backing_id`` and set the +``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific +operations. + +Currently, passthrough is supported for operations like ``read(2)``/``write(2)`` +(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``. + +Enabling Passthrough +==================== + +To use FUSE passthrough: + + 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH`` + enabled. + 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the + ``FUSE_PASSTHROUGH`` capability and specify its desired + ``max_stack_depth``. + 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl + on its connection file descriptor (e.g., ``/dev/fuse``) to register a + backing file descriptor and obtain a ``backing_id``. + 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon + replies with the ``FOPEN_PASSTHROUGH`` flag set in + ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id`` + in ``fuse_open_out::backing_id``. + 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with + the ``backing_id`` to release the kernel's reference to the backing file + when it's no longer needed for passthrough setups. + +Privilege Requirements +====================== + +Setting up passthrough functionality currently requires the FUSE daemon to +possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several +security and resource management considerations that are actively being +discussed and worked on. The primary reasons for this restriction are detailed +below. + +Resource Accounting and Visibility +---------------------------------- + +The core mechanism for passthrough involves the FUSE daemon opening a file +descriptor to a backing file and registering it with the FUSE kernel module via +the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id`` +associated with a kernel-internal ``struct fuse_backing`` object, which holds a +reference to the backing ``struct file``. + +A significant concern arises because the FUSE daemon can close its own file +descriptor to the backing file after registration. The kernel, however, will +still hold a reference to the ``struct file`` via the ``struct fuse_backing`` +object as long as it's associated with a ``backing_id`` (or subsequently, with +an open FUSE file in passthrough mode). + +This behavior leads to two main issues for unprivileged FUSE daemons: + + 1. **Invisibility to lsof and other inspection tools**: Once the FUSE + daemon closes its file descriptor, the open backing file held by the kernel + becomes "hidden." Standard tools like ``lsof``, which typically inspect + process file descriptor tables, would not be able to identify that this + file is still open by the system on behalf of the FUSE filesystem. This + makes it difficult for system administrators to track resource usage or + debug issues related to open files (e.g., preventing unmounts). + + 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to + resource limits, including the maximum number of open file descriptors + (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files + and then close its own FDs, it could potentially cause the kernel to hold + an unlimited number of open ``struct file`` references without these being + accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a + denial-of-service (DoS) by exhausting system-wide file resources. + +The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues, +restricting this powerful capability to trusted processes. + +**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files", +which are visible via ``fdinfo`` and accounted under the registering user's +``RLIMIT_NOFILE``. + +Filesystem Stacking and Shutdown Loops +-------------------------------------- + +Another concern relates to the potential for creating complex and problematic +filesystem stacking scenarios if unprivileged users could set up passthrough. +A FUSE passthrough filesystem might use a backing file that resides: + + * On the *same* FUSE filesystem. + * On another filesystem (like OverlayFS) which itself might have an upper or + lower layer that is a FUSE filesystem. + +These configurations could create dependency loops, particularly during +filesystem shutdown or unmount sequences, leading to deadlocks or system +instability. This is conceptually similar to the risks associated with the +``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``. + +To mitigate this, FUSE passthrough already incorporates checks based on +filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``). +For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate +the ``max_stack_depth`` it supports. When a backing file is registered via +``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's +filesystem stack depth is within the allowed limit. + +The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security, +ensuring that only privileged users can create these potentially complex +stacking arrangements. + +General Security Posture +------------------------ + +As a general principle for new kernel features that allow userspace to instruct +the kernel to perform direct operations on its behalf based on user-provided +file descriptors, starting with a higher privilege requirement (like +``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows +the feature to be used and tested while further security implications are +evaluated and addressed. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index a9cf8e950b15..11a599387266 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -99,6 +99,7 @@ Documentation for filesystem implementations. fuse fuse-io fuse-io-uring + fuse-passthrough inotify isofs nilfs2 @@ -113,6 +114,7 @@ Documentation for filesystem implementations. qnx6 ramfs-rootfs-initramfs relay + resctrl romfs smb/index spufs/index diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 939b4b624fad..ddd799df6ce3 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -712,11 +712,6 @@ handle falling back from one source type to another. The members are: at a boundary with the filesystem structure (e.g. at the end of a Ceph object). It tells netfslib not to retile subrequests across it. - * ``NETFS_SREQ_SEEK_DATA_READ`` - - This is a hint from netfslib to the cache that it might want to try - skipping ahead to the next data (ie. using SEEK_DATA). - * ``error`` This is for the filesystem to store result of the subrequest. It should be diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 2db379b4b31e..4133a336486d 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -443,6 +443,13 @@ Only the data of the files in the "data-only" lower layers may be visible when a "metacopy" file in one of the lower layers above it, has a "redirect" to the absolute path of the "lower data" file in the "data-only" lower layer. +Instead of explicitly enabling "metacopy=on" it is sufficient to specify at +least one data-only layer to enable redirection of data to a data-only layer. +In this case other forms of metacopy are rejected. Note: this way data-only +layers may be used toghether with "userxattr", in which case careful attention +must be given to privileges needed to change the "user.overlay.redirect" xattr +to prevent misuse. + Since kernel version v6.8, "data-only" lower layers can also be added using the "datadir+" mount options and the fsconfig syscall from new mount api. For example:: diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 3111ef5592f3..a5734bdd1cc7 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1243,3 +1243,18 @@ arguments in the opposite order but is otherwise identical. Using try_lookup_noperm() will require linux/namei.h to be included. +--- + +**mandatory** + +Calling conventions for ->d_automount() have changed; we should *not* grab +an extra reference to new mount - it should be returned with refcount 1. + +--- + +collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now. +Replacement is collect_paths()/drop_collected_path(), with no special +iterator needed. Instead of a cloned mount tree, the new interface returns +an array of struct path, one for each mount collect_mounts() would've +created. These struct path point to locations in the caller's namespace +that would be roots of the cloned mounts. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2a17865dfe39..5236cb52e357 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -584,7 +584,6 @@ encoded manner. The codes are the following: ms may share gd stack segment growns down pf pure PFN range - dw disabled write to the mapped file lo pages are locked in memory io memory mapped I/O area sr sequential read advise provided @@ -607,8 +606,11 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking + ui userfaultfd minor fault ss shadow/guarded control stack page sl sealed + lf lock on fault pages + dp always lazily freeable mapping == ======================================= Note that there is no guarantee that every flag and associated mnemonic will diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst index 04ad083cfe62..301ff4c6e6c6 100644 --- a/Documentation/filesystems/relay.rst +++ b/Documentation/filesystems/relay.rst @@ -32,7 +32,7 @@ functions in the relay interface code - please see that for details. Semantics ========= -Each relay channel has one buffer per CPU, each buffer has one or more +Each relay channel has one buffer per CPU; each buffer has one or more sub-buffers. Messages are written to the first sub-buffer until it is too full to contain a new message, in which case it is written to the next (if available). Messages are never split across sub-buffers. @@ -40,7 +40,7 @@ At this point, userspace can be notified so it empties the first sub-buffer, while the kernel continues writing to the next. When notified that a sub-buffer is full, the kernel knows how many -bytes of it are padding i.e. unused space occurring because a complete +bytes of it are padding, i.e., unused space occurring because a complete message couldn't fit into a sub-buffer. Userspace can use this knowledge to copy only valid data. @@ -71,7 +71,7 @@ klog and relay-apps example code ================================ The relay interface itself is ready to use, but to make things easier, -a couple simple utility functions and a set of examples are provided. +a couple of simple utility functions and a set of examples are provided. The relay-apps example tarball, available on the relay sourceforge site, contains a set of self-contained examples, each consisting of a @@ -91,7 +91,7 @@ registered will data actually be logged (see the klog and kleak examples for details). It is of course possible to use the relay interface from scratch, -i.e. without using any of the relay-apps example code or klog, but +i.e., without using any of the relay-apps example code or klog, but you'll have to implement communication between userspace and kernel, allowing both to convey the state of buffers (full, empty, amount of padding). The read() interface both removes padding and internally @@ -119,7 +119,7 @@ mmap() results in channel buffer being mapped into the caller's must map the entire file, which is NRBUF * SUBBUFSIZE. read() read the contents of a channel buffer. The bytes read are - 'consumed' by the reader, i.e. they won't be available + 'consumed' by the reader, i.e., they won't be available again to subsequent reads. If the channel is being used in no-overwrite mode (the default), it can be read at any time even if there's an active kernel writer. If the @@ -138,7 +138,7 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are notified when sub-buffer boundaries are crossed. close() decrements the channel buffer's refcount. When the refcount - reaches 0, i.e. when no process or kernel client has the + reaches 0, i.e., when no process or kernel client has the buffer open, the channel buffer is freed. =========== ============================================================ @@ -149,7 +149,7 @@ host filesystem must be mounted. For example:: .. Note:: - the host filesystem doesn't need to be mounted for kernel + The host filesystem doesn't need to be mounted for kernel clients to create or use channels - it only needs to be mounted when user space applications need access to the buffer data. @@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available (including in create_buf_file()) via chan->private_data or buf->chan->private_data. -Buffer-only channels --------------------- - -These channels have no files associated and can be created with -relay_open(NULL, NULL, ...). Such channels are useful in scenarios such -as when doing early tracing in the kernel, before the VFS is up. In these -cases, one may open a buffer-only channel and then call -relay_late_setup_files() when the kernel is ready to handle files, -to expose the buffered data to the userspace. - Channel 'modes' --------------- @@ -325,7 +315,7 @@ section, as it pertains mainly to mmap() implementations. In 'overwrite' mode, also known as 'flight recorder' mode, writes continuously cycle around the buffer and will never fail, but will unconditionally overwrite old data regardless of whether it's actually -been consumed. In no-overwrite mode, writes will fail, i.e. data will +been consumed. In no-overwrite mode, writes will fail, i.e., data will be lost, if the number of unconsumed sub-buffers equals the total number of sub-buffers in the channel. It should be clear that if there is no consumer or if the consumer can't consume sub-buffers fast @@ -344,7 +334,7 @@ initialize the next sub-buffer if appropriate 2) finalize the previous sub-buffer if appropriate and 3) return a boolean value indicating whether or not to actually move on to the next sub-buffer. -To implement 'no-overwrite' mode, the userspace client would provide +To implement 'no-overwrite' mode, the userspace client provides an implementation of the subbuf_start() callback something like the following:: @@ -364,9 +354,9 @@ following:: return 1; } -If the current buffer is full, i.e. all sub-buffers remain unconsumed, +If the current buffer is full, i.e., all sub-buffers remain unconsumed, the callback returns 0 to indicate that the buffer switch should not -occur yet, i.e. until the consumer has had a chance to read the +occur yet, i.e., until the consumer has had a chance to read the current set of ready sub-buffers. For the relay_buf_full() function to make sense, the consumer is responsible for notifying the relay interface when sub-buffers have been consumed via @@ -400,7 +390,7 @@ consulted. The default subbuf_start() implementation, used if the client doesn't define any callbacks, or doesn't define the subbuf_start() callback, -implements the simplest possible 'no-overwrite' mode, i.e. it does +implements the simplest possible 'no-overwrite' mode, i.e., it does nothing but return 0. Header information can be reserved at the beginning of each sub-buffer @@ -467,7 +457,7 @@ rather than open and close a new channel for each use. relay_reset() can be used for this purpose - it resets a channel to its initial state without reallocating channel buffer memory or destroying existing mappings. It should however only be called when it's safe to -do so, i.e. when the channel isn't currently being written to. +do so, i.e., when the channel isn't currently being written to. Finally, there are a couple of utility callbacks that can be used for different purposes. buf_mapped() is called whenever a channel buffer diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst new file mode 100644 index 000000000000..c7949dd44f2f --- /dev/null +++ b/Documentation/filesystems/resctrl.rst @@ -0,0 +1,1523 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +===================================================== +User Interface for Resource Control feature (resctrl) +===================================================== + +:Copyright: |copy| 2016 Intel Corporation +:Authors: - Fenghua Yu <fenghua.yu@intel.com> + - Tony Luck <tony.luck@intel.com> + - Vikas Shivappa <vikas.shivappa@intel.com> + + +Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). +AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). + +This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo +flag bits: + +=============================================== ================================ +RDT (Resource Director Technology) Allocation "rdt_a" +CAT (Cache Allocation Technology) "cat_l3", "cat_l2" +CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" +CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" +MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" +MBA (Memory Bandwidth Allocation) "mba" +SMBA (Slow Memory Bandwidth Allocation) "" +BMEC (Bandwidth Monitoring Event Configuration) "" +=============================================== ================================ + +Historically, new features were made visible by default in /proc/cpuinfo. This +resulted in the feature flags becoming hard to parse by humans. Adding a new +flag to /proc/cpuinfo should be avoided if user space can obtain information +about the feature from resctrl's info directory. + +To use the feature mount the file system:: + + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl + +mount options are: + +"cdp": + Enable code/data prioritization in L3 cache allocations. +"cdpl2": + Enable code/data prioritization in L2 cache allocations. +"mba_MBps": + Enable the MBA Software Controller(mba_sc) to specify MBA + bandwidth in MiBps +"debug": + Make debug files accessible. Available debug files are annotated with + "Available only with debug option". + +L2 and L3 CDP are controlled separately. + +RDT features are orthogonal. A particular system may support only +monitoring, only control, or both monitoring and control. Cache +pseudo-locking is a unique way of using cache control to "pin" or +"lock" data in the cache. Details can be found in +"Cache Pseudo-Locking". + + +The mount succeeds if either of allocation or monitoring is present, but +only those files and directories supported by the system will be created. +For more details on the behavior of the interface during monitoring +and allocation, see the "Resource alloc and monitor groups" section. + +Info directory +============== + +The 'info' directory contains information about the enabled +resources. Each resource has its own subdirectory. The subdirectory +names reflect the resource names. + +Each subdirectory contains the following files with respect to +allocation: + +Cache resource(L3/L2) subdirectory contains the following files +related to allocation: + +"num_closids": + The number of CLOSIDs which are valid for this + resource. The kernel uses the smallest number of + CLOSIDs of all enabled resources as limit. +"cbm_mask": + The bitmask which is valid for this resource. + This mask is equivalent to 100%. +"min_cbm_bits": + The minimum number of consecutive bits which + must be set when writing a mask. + +"shareable_bits": + Bitmask of shareable resource with other executing + entities (e.g. I/O). User can use this when + setting up exclusive cache partitions. Note that + some platforms support devices that have their + own settings for cache use which can over-ride + these bits. +"bit_usage": + Annotated capacity bitmasks showing how all + instances of the resource are used. The legend is: + + "0": + Corresponding region is unused. When the system's + resources have been allocated and a "0" is found + in "bit_usage" it is a sign that resources are + wasted. + + "H": + Corresponding region is used by hardware only + but available for software use. If a resource + has bits set in "shareable_bits" but not all + of these bits appear in the resource groups' + schematas then the bits appearing in + "shareable_bits" but no resource group will + be marked as "H". + "X": + Corresponding region is available for sharing and + used by hardware and software. These are the + bits that appear in "shareable_bits" as + well as a resource group's allocation. + "S": + Corresponding region is used by software + and available for sharing. + "E": + Corresponding region is used exclusively by + one resource group. No sharing allowed. + "P": + Corresponding region is pseudo-locked. No + sharing allowed. +"sparse_masks": + Indicates if non-contiguous 1s value in CBM is supported. + + "0": + Only contiguous 1s value in CBM is supported. + "1": + Non-contiguous 1s value in CBM is supported. + +Memory bandwidth(MB) subdirectory contains the following files +with respect to allocation: + +"min_bandwidth": + The minimum memory bandwidth percentage which + user can request. + +"bandwidth_gran": + The granularity in which the memory bandwidth + percentage is allocated. The allocated + b/w percentage is rounded off to the next + control step available on the hardware. The + available bandwidth control steps are: + min_bandwidth + N * bandwidth_gran. + +"delay_linear": + Indicates if the delay scale is linear or + non-linear. This field is purely informational + only. + +"thread_throttle_mode": + Indicator on Intel systems of how tasks running on threads + of a physical core are throttled in cases where they + request different memory bandwidth percentages: + + "max": + the smallest percentage is applied + to all threads + "per-thread": + bandwidth percentages are directly applied to + the threads running on the core + +If RDT monitoring is available there will be an "L3_MON" directory +with the following files: + +"num_rmids": + The number of RMIDs available. This is the + upper bound for how many "CTRL_MON" + "MON" + groups can be created. + +"mon_features": + Lists the monitoring events if + monitoring is enabled for the resource. + Example:: + + # cat /sys/fs/resctrl/info/L3_MON/mon_features + llc_occupancy + mbm_total_bytes + mbm_local_bytes + + If the system supports Bandwidth Monitoring Event + Configuration (BMEC), then the bandwidth events will + be configurable. The output will be:: + + # cat /sys/fs/resctrl/info/L3_MON/mon_features + llc_occupancy + mbm_total_bytes + mbm_total_bytes_config + mbm_local_bytes + mbm_local_bytes_config + +"mbm_total_bytes_config", "mbm_local_bytes_config": + Read/write files containing the configuration for the mbm_total_bytes + and mbm_local_bytes events, respectively, when the Bandwidth + Monitoring Event Configuration (BMEC) feature is supported. + The event configuration settings are domain specific and affect + all the CPUs in the domain. When either event configuration is + changed, the bandwidth counters for all RMIDs of both events + (mbm_total_bytes as well as mbm_local_bytes) are cleared for that + domain. The next read for every RMID will report "Unavailable" + and subsequent reads will report the valid value. + + Following are the types of events supported: + + ==== ======================================================== + Bits Description + ==== ======================================================== + 6 Dirty Victims from the QOS domain to all types of memory + 5 Reads to slow memory in the non-local NUMA domain + 4 Reads to slow memory in the local NUMA domain + 3 Non-temporal writes to non-local NUMA domain + 2 Non-temporal writes to local NUMA domain + 1 Reads to memory in the non-local NUMA domain + 0 Reads to memory in the local NUMA domain + ==== ======================================================== + + By default, the mbm_total_bytes configuration is set to 0x7f to count + all the event types and the mbm_local_bytes configuration is set to + 0x15 to count all the local memory events. + + Examples: + + * To view the current configuration:: + :: + + # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + 0=0x7f;1=0x7f;2=0x7f;3=0x7f + + # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + 0=0x15;1=0x15;3=0x15;4=0x15 + + * To change the mbm_total_bytes to count only reads on domain 0, + the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary + (in hexadecimal 0x33): + :: + + # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + + # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config + 0=0x33;1=0x7f;2=0x7f;3=0x7f + + * To change the mbm_local_bytes to count all the slow memory reads on + domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b + in binary (in hexadecimal 0x30): + :: + + # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + + # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config + 0=0x30;1=0x30;3=0x15;4=0x15 + +"max_threshold_occupancy": + Read/write file provides the largest value (in + bytes) at which a previously used LLC_occupancy + counter can be considered for re-use. + +Finally, in the top level of the "info" directory there is a file +named "last_cmd_status". This is reset with every "command" issued +via the file system (making new directories or writing to any of the +control files). If the command was successful, it will read as "ok". +If the command failed, it will provide more information that can be +conveyed in the error returns from file operations. E.g. +:: + + # echo L3:0=f7 > schemata + bash: echo: write error: Invalid argument + # cat info/last_cmd_status + mask f7 has non-consecutive 1-bits + +Resource alloc and monitor groups +================================= + +Resource groups are represented as directories in the resctrl file +system. The default group is the root directory which, immediately +after mounting, owns all the tasks and cpus in the system and can make +full use of all resources. + +On a system with RDT control features additional directories can be +created in the root directory that specify different amounts of each +resource (see "schemata" below). The root and these additional top level +directories are referred to as "CTRL_MON" groups below. + +On a system with RDT monitoring the root directory and other top level +directories contain a directory named "mon_groups" in which additional +directories can be created to monitor subsets of tasks in the CTRL_MON +group that is their ancestor. These are called "MON" groups in the rest +of this document. + +Removing a directory will move all tasks and cpus owned by the group it +represents to the parent. Removing one of the created CTRL_MON groups +will automatically remove all MON groups below it. + +Moving MON group directories to a new parent CTRL_MON group is supported +for the purpose of changing the resource allocations of a MON group +without impacting its monitoring data or assigned tasks. This operation +is not allowed for MON groups which monitor CPUs. No other move +operation is currently allowed other than simply renaming a CTRL_MON or +MON group. + +All groups contain the following files: + +"tasks": + Reading this file shows the list of all tasks that belong to + this group. Writing a task id to the file will add a task to the + group. Multiple tasks can be added by separating the task ids + with commas. Tasks will be assigned sequentially. Multiple + failures are not supported. A single failure encountered while + attempting to assign a task will cause the operation to abort and + already added tasks before the failure will remain in the group. + Failures will be logged to /sys/fs/resctrl/info/last_cmd_status. + + If the group is a CTRL_MON group the task is removed from + whichever previous CTRL_MON group owned the task and also from + any MON group that owned the task. If the group is a MON group, + then the task must already belong to the CTRL_MON parent of this + group. The task is removed from any previous MON group. + + +"cpus": + Reading this file shows a bitmask of the logical CPUs owned by + this group. Writing a mask to this file will add and remove + CPUs to/from this group. As with the tasks file a hierarchy is + maintained where MON groups may only include CPUs owned by the + parent CTRL_MON group. + When the resource group is in pseudo-locked mode this file will + only be readable, reflecting the CPUs associated with the + pseudo-locked region. + + +"cpus_list": + Just like "cpus", only using ranges of CPUs instead of bitmasks. + + +When control is enabled all CTRL_MON groups will also contain: + +"schemata": + A list of all the resources available to this group. + Each resource has its own line and format - see below for details. + +"size": + Mirrors the display of the "schemata" file to display the size in + bytes of each allocation instead of the bits representing the + allocation. + +"mode": + The "mode" of the resource group dictates the sharing of its + allocations. A "shareable" resource group allows sharing of its + allocations while an "exclusive" resource group does not. A + cache pseudo-locked region is created by first writing + "pseudo-locksetup" to the "mode" file before writing the cache + pseudo-locked region's schemata to the resource group's "schemata" + file. On successful pseudo-locked region creation the mode will + automatically change to "pseudo-locked". + +"ctrl_hw_id": + Available only with debug option. The identifier used by hardware + for the control group. On x86 this is the CLOSID. + +When monitoring is enabled all MON groups will also contain: + +"mon_data": + This contains a set of files organized by L3 domain and by + RDT event. E.g. on a system with two L3 domains there will + be subdirectories "mon_L3_00" and "mon_L3_01". Each of these + directories have one file per event (e.g. "llc_occupancy", + "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these + files provide a read out of the current value of the event for + all tasks in the group. In CTRL_MON groups these files provide + the sum for all tasks in the CTRL_MON group and all tasks in + MON groups. Please see example section for more details on usage. + On systems with Sub-NUMA Cluster (SNC) enabled there are extra + directories for each node (located within the "mon_L3_XX" directory + for the L3 cache they occupy). These are named "mon_sub_L3_YY" + where "YY" is the node number. + +"mon_hw_id": + Available only with debug option. The identifier used by hardware + for the monitor group. On x86 this is the RMID. + +When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: + +"mba_MBps_event": + Reading this file shows which memory bandwidth event is used + as input to the software feedback loop that keeps memory bandwidth + below the value specified in the schemata file. Writing the + name of one of the supported memory bandwidth events found in + /sys/fs/resctrl/info/L3_MON/mon_features changes the input + event. + +Resource allocation rules +------------------------- + +When a task is running the following rules define which resources are +available to it: + +1) If the task is a member of a non-default group, then the schemata + for that group is used. + +2) Else if the task belongs to the default group, but is running on a + CPU that is assigned to some specific group, then the schemata for the + CPU's group is used. + +3) Otherwise the schemata for the default group is used. + +Resource monitoring rules +------------------------- +1) If a task is a member of a MON group, or non-default CTRL_MON group + then RDT events for the task will be reported in that group. + +2) If a task is a member of the default CTRL_MON group, but is running + on a CPU that is assigned to some specific group, then the RDT events + for the task will be reported in that group. + +3) Otherwise RDT events for the task will be reported in the root level + "mon_data" group. + + +Notes on cache occupancy monitoring and control +=============================================== +When moving a task from one group to another you should remember that +this only affects *new* cache allocations by the task. E.g. you may have +a task in a monitor group showing 3 MB of cache occupancy. If you move +to a new group and immediately check the occupancy of the old and new +groups you will likely see that the old group is still showing 3 MB and +the new group zero. When the task accesses locations still in cache from +before the move, the h/w does not update any counters. On a busy system +you will likely see the occupancy in the old group go down as cache lines +are evicted and re-used while the occupancy in the new group rises as +the task accesses memory and loads into the cache are counted based on +membership in the new group. + +The same applies to cache allocation control. Moving a task to a group +with a smaller cache partition will not evict any cache lines. The +process may continue to use them from the old partition. + +Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) +to identify a control group and a monitoring group respectively. Each of +the resource groups are mapped to these IDs based on the kind of group. The +number of CLOSid and RMID are limited by the hardware and hence the creation of +a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID +and creation of "MON" group may fail if we run out of RMIDs. + +max_threshold_occupancy - generic concepts +------------------------------------------ + +Note that an RMID once freed may not be immediately available for use as +the RMID is still tagged the cache lines of the previous user of RMID. +Hence such RMIDs are placed on limbo list and checked back if the cache +occupancy has gone down. If there is a time when system has a lot of +limbo RMIDs but which are not ready to be used, user may see an -EBUSY +during mkdir. + +max_threshold_occupancy is a user configurable value to determine the +occupancy at which an RMID can be freed. + +The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes +for a subset of RMID that are not immediately available for allocation. +This can't be relied on to produce output every second, it may be necessary +to attempt to create an empty monitor group to force an update. Output may +only be produced if creation of a control or monitor group fails. + +Schemata files - general concepts +--------------------------------- +Each line in the file describes one resource. The line starts with +the name of the resource, followed by specific values to be applied +in each of the instances of that resource on the system. + +Cache IDs +--------- +On current generation systems there is one L3 cache per socket and L2 +caches are generally just shared by the hyperthreads on a core, but this +isn't an architectural requirement. We could have multiple separate L3 +caches on a socket, multiple cores could share an L2 cache. So instead +of using "socket" or "core" to define the set of logical cpus sharing +a resource we use a "Cache ID". At a given cache level this will be a +unique number across the whole system (but it isn't guaranteed to be a +contiguous sequence, there may be gaps). To find the ID for each logical +CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id + +Cache Bit Masks (CBM) +--------------------- +For cache resources we describe the portion of the cache that is available +for allocation using a bitmask. The maximum value of the mask is defined +by each cpu model (and may be different for different cache levels). It +is found using CPUID, but is also provided in the "info" directory of +the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware +requires that these masks have all the '1' bits in a contiguous block. So +0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 +and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks +if non-contiguous 1s value is supported. On a system with a 20-bit mask +each bit represents 5% of the capacity of the cache. You could partition +the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. + +Notes on Sub-NUMA Cluster mode +============================== +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA +nodes much more readily than between regular NUMA nodes since the CPUs +on Sub-NUMA nodes share the same L3 cache and the system may report +the NUMA distance between Sub-NUMA nodes with a lower value than used +for regular NUMA nodes. + +The top-level monitoring files in each "mon_L3_XX" directory provide +the sum of data across all SNC nodes sharing an L3 cache instance. +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the +"mon_sub_L3_YY" directories to get node local data. + +Memory bandwidth allocation is still performed at the L3 cache +level. I.e. throttling controls are applied to all SNC nodes. + +L3 cache allocation bitmaps also apply to all SNC nodes. But note that +the amount of L3 cache represented by each bit is divided by the number +of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit +allocation masks each bit normally represents 10MB. With SNC mode enabled +with two SNC nodes per L3 cache, each bit only represents 5MB. + +Memory bandwidth Allocation and monitoring +========================================== + +For Memory bandwidth resource, by default the user controls the resource +by indicating the percentage of total memory bandwidth. + +The minimum bandwidth percentage value for each cpu model is predefined +and can be looked up through "info/MB/min_bandwidth". The bandwidth +granularity that is allocated is also dependent on the cpu model and can +be looked up at "info/MB/bandwidth_gran". The available bandwidth +control steps are: min_bw + N * bw_gran. Intermediate values are rounded +to the next control step available on the hardware. + +The bandwidth throttling is a core specific mechanism on some of Intel +SKUs. Using a high bandwidth and a low bandwidth setting on two threads +sharing a core may result in both threads being throttled to use the +low bandwidth (see "thread_throttle_mode"). + +The fact that Memory bandwidth allocation(MBA) may be a core +specific mechanism where as memory bandwidth monitoring(MBM) is done at +the package level may lead to confusion when users try to apply control +via the MBA and then monitor the bandwidth to see if the controls are +effective. Below are such scenarios: + +1. User may *not* see increase in actual bandwidth when percentage + values are increased: + +This can occur when aggregate L2 external bandwidth is more than L3 +external bandwidth. Consider an SKL SKU with 24 cores on a package and +where L2 external is 10GBps (hence aggregate L2 external bandwidth is +240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 +threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 +bandwidth of 100GBps although the percentage value specified is only 50% +<< 100%. Hence increasing the bandwidth percentage will not yield any +more bandwidth. This is because although the L2 external bandwidth still +has capacity, the L3 external bandwidth is fully used. Also note that +this would be dependent on number of cores the benchmark is run on. + +2. Same bandwidth percentage may mean different actual bandwidth + depending on # of threads: + +For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 +thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although +they have same percentage bandwidth of 10%. This is simply because as +threads start using more cores in an rdtgroup, the actual bandwidth may +increase or vary although user specified bandwidth percentage is same. + +In order to mitigate this and make the interface more user friendly, +resctrl added support for specifying the bandwidth in MiBps as well. The +kernel underneath would use a software feedback mechanism or a "Software +Controller(mba_sc)" which reads the actual bandwidth using MBM counters +and adjust the memory bandwidth percentages to ensure:: + + "actual bandwidth < user specified bandwidth". + +By default, the schemata would take the bandwidth percentage values +where as user can switch to the "MBA software controller" mode using +a mount option 'mba_MBps'. The schemata format is specified in the below +sections. + +L3 schemata file details (code and data prioritization disabled) +---------------------------------------------------------------- +With CDP disabled the L3 schemata format is:: + + L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +L3 schemata file details (CDP enabled via mount option to resctrl) +------------------------------------------------------------------ +When CDP is enabled L3 control is split into two separate resources +so you can specify independent masks for code and data like this:: + + L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +L2 schemata file details +------------------------ +CDP is supported at L2 using the 'cdpl2' mount option. The schemata +format is either:: + + L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + +or + + L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... + + +Memory bandwidth Allocation (default mode) +------------------------------------------ + +Memory b/w domain is L3 cache. +:: + + MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... + +Memory bandwidth Allocation specified in MiBps +---------------------------------------------- + +Memory bandwidth domain is L3 cache. +:: + + MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;... + +Slow Memory Bandwidth Allocation (SMBA) +--------------------------------------- +AMD hardware supports Slow Memory Bandwidth Allocation (SMBA). +CXL.memory is the only supported "slow" memory device. With the +support of SMBA, the hardware enables bandwidth allocation on +the slow memory devices. If there are multiple such devices in +the system, the throttling logic groups all the slow sources +together and applies the limit on them as a whole. + +The presence of SMBA (with CXL.memory) is independent of slow memory +devices presence. If there are no such devices on the system, then +configuring SMBA will have no impact on the performance of the system. + +The bandwidth domain for slow memory is L3 cache. Its schemata file +is formatted as: +:: + + SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... + +Reading/writing the schemata file +--------------------------------- +Reading the schemata file will show the state of all resources +on all domains. When writing you only need to specify those values +which you wish to change. E.g. +:: + + # cat schemata + L3DATA:0=fffff;1=fffff;2=fffff;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + # echo "L3DATA:2=3c0;" > schemata + # cat schemata + L3DATA:0=fffff;1=fffff;2=3c0;3=fffff + L3CODE:0=fffff;1=fffff;2=fffff;3=fffff + +Reading/writing the schemata file (on AMD systems) +-------------------------------------------------- +Reading the schemata file will show the current bandwidth limit on all +domains. The allocated resources are in multiples of one eighth GB/s. +When writing to the file, you need to specify what cache id you wish to +configure the bandwidth limit. + +For example, to allocate 2GB/s limit on the first cache id: + +:: + + # cat schemata + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + + # echo "MB:1=16" > schemata + # cat schemata + MB:0=2048;1= 16;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + +Reading/writing the schemata file (on AMD systems) with SMBA feature +-------------------------------------------------------------------- +Reading and writing the schemata file is the same as without SMBA in +above section. + +For example, to allocate 8GB/s limit on the first cache id: + +:: + + # cat schemata + SMBA:0=2048;1=2048;2=2048;3=2048 + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + + # echo "SMBA:1=64" > schemata + # cat schemata + SMBA:0=2048;1= 64;2=2048;3=2048 + MB:0=2048;1=2048;2=2048;3=2048 + L3:0=ffff;1=ffff;2=ffff;3=ffff + +Cache Pseudo-Locking +==================== +CAT enables a user to specify the amount of cache space that an +application can fill. Cache pseudo-locking builds on the fact that a +CPU can still read and write data pre-allocated outside its current +allocated area on a cache hit. With cache pseudo-locking, data can be +preloaded into a reserved portion of cache that no application can +fill, and from that point on will only serve cache hits. The cache +pseudo-locked memory is made accessible to user space where an +application can map it into its virtual address space and thus have +a region of memory with reduced average read latency. + +The creation of a cache pseudo-locked region is triggered by a request +from the user to do so that is accompanied by a schemata of the region +to be pseudo-locked. The cache pseudo-locked region is created as follows: + +- Create a CAT allocation CLOSNEW with a CBM matching the schemata + from the user of the cache region that will contain the pseudo-locked + memory. This region must not overlap with any current CAT allocation/CLOS + on the system and no future overlap with this cache region is allowed + while the pseudo-locked region exists. +- Create a contiguous region of memory of the same size as the cache + region. +- Flush the cache, disable hardware prefetchers, disable preemption. +- Make CLOSNEW the active CLOS and touch the allocated memory to load + it into the cache. +- Set the previous CLOS as active. +- At this point the closid CLOSNEW can be released - the cache + pseudo-locked region is protected as long as its CBM does not appear in + any CAT allocation. Even though the cache pseudo-locked region will from + this point on not appear in any CBM of any CLOS an application running with + any CLOS will be able to access the memory in the pseudo-locked region since + the region continues to serve cache hits. +- The contiguous region of memory loaded into the cache is exposed to + user-space as a character device. + +Cache pseudo-locking increases the probability that data will remain +in the cache via carefully configuring the CAT feature and controlling +application behavior. There is no guarantee that data is placed in +cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict +“locked” data from cache. Power management C-states may shrink or +power off cache. Deeper C-states will automatically be restricted on +pseudo-locked region creation. + +It is required that an application using a pseudo-locked region runs +with affinity to the cores (or a subset of the cores) associated +with the cache on which the pseudo-locked region resides. A sanity check +within the code will not allow an application to map pseudo-locked memory +unless it runs with affinity to cores associated with the cache on which the +pseudo-locked region resides. The sanity check is only done during the +initial mmap() handling, there is no enforcement afterwards and the +application self needs to ensure it remains affine to the correct cores. + +Pseudo-locking is accomplished in two stages: + +1) During the first stage the system administrator allocates a portion + of cache that should be dedicated to pseudo-locking. At this time an + equivalent portion of memory is allocated, loaded into allocated + cache portion, and exposed as a character device. +2) During the second stage a user-space application maps (mmap()) the + pseudo-locked memory into its address space. + +Cache Pseudo-Locking Interface +------------------------------ +A pseudo-locked region is created using the resctrl interface as follows: + +1) Create a new resource group by creating a new directory in /sys/fs/resctrl. +2) Change the new resource group's mode to "pseudo-locksetup" by writing + "pseudo-locksetup" to the "mode" file. +3) Write the schemata of the pseudo-locked region to the "schemata" file. All + bits within the schemata should be "unused" according to the "bit_usage" + file. + +On successful pseudo-locked region creation the "mode" file will contain +"pseudo-locked" and a new character device with the same name as the resource +group will exist in /dev/pseudo_lock. This character device can be mmap()'ed +by user space in order to obtain access to the pseudo-locked memory region. + +An example of cache pseudo-locked region creation and usage can be found below. + +Cache Pseudo-Locking Debugging Interface +---------------------------------------- +The pseudo-locking debugging interface is enabled by default (if +CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. + +There is no explicit way for the kernel to test if a provided memory +location is present in the cache. The pseudo-locking debugging interface uses +the tracing infrastructure to provide two ways to measure cache residency of +the pseudo-locked region: + +1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data + from these measurements are best visualized using a hist trigger (see + example below). In this test the pseudo-locked region is traversed at + a stride of 32 bytes while hardware prefetchers and preemption + are disabled. This also provides a substitute visualization of cache + hits and misses. +2) Cache hit and miss measurements using model specific precision counters if + available. Depending on the levels of cache on the system the pseudo_lock_l2 + and pseudo_lock_l3 tracepoints are available. + +When a pseudo-locked region is created a new debugfs directory is created for +it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single +write-only file, pseudo_lock_measure, is present in this directory. The +measurement of the pseudo-locked region depends on the number written to this +debugfs file: + +1: + writing "1" to the pseudo_lock_measure file will trigger the latency + measurement captured in the pseudo_lock_mem_latency tracepoint. See + example below. +2: + writing "2" to the pseudo_lock_measure file will trigger the L2 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l2 tracepoint. See example below. +3: + writing "3" to the pseudo_lock_measure file will trigger the L3 cache + residency (cache hits and misses) measurement captured in the + pseudo_lock_l3 tracepoint. + +All measurements are recorded with the tracing infrastructure. This requires +the relevant tracepoints to be enabled before the measurement is triggered. + +Example of latency debugging interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created. Here is +how we can measure the latency in cycles of reading from this region and +visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS +is set:: + + # :> /sys/kernel/tracing/trace + # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger + # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable + # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist + + # event histogram + # + # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] + # + + { latency: 456 } hitcount: 1 + { latency: 50 } hitcount: 83 + { latency: 36 } hitcount: 96 + { latency: 44 } hitcount: 174 + { latency: 48 } hitcount: 195 + { latency: 46 } hitcount: 262 + { latency: 42 } hitcount: 693 + { latency: 40 } hitcount: 3204 + { latency: 38 } hitcount: 3484 + + Totals: + Hits: 8192 + Entries: 9 + Dropped: 0 + +Example of cache hits/misses debugging +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this example a pseudo-locked region named "newlock" was created on the L2 +cache of a platform. Here is how we can obtain details of the cache hits +and misses using the platform's precision counters. +:: + + # :> /sys/kernel/tracing/trace + # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable + # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure + # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable + # cat /sys/kernel/tracing/trace + + # tracer: nop + # + # _-----=> irqs-off + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / delay + # TASK-PID CPU# |||| TIMESTAMP FUNCTION + # | | | |||| | | + pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 + + +Examples for RDT allocation usage +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1) Example 1 + +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks, minimum b/w of 10% with a memory bandwidth +granularity of 10%. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Similarly, tasks that are under the control of group "p0" may use a +maximum memory b/w of 50% on socket0 and 50% on socket 1. +Tasks in group "p1" may also use 50% memory b/w on both sockets. +Note that unlike cache masks, memory b/w cannot specify whether these +allocations can overlap or not. The allocations specifies the maximum +b/w that the group may be able to use and the system admin can configure +the b/w accordingly. + +If resctrl is using the software controller (mba_sc) then user can enter the +max b/w in MB rather than the percentage values. +:: + + # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata + +In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w +of 1024MB where as on socket 1 they would use 500MB. + +2) Example 2 + +Again two sockets, but this time with a more realistic 20-bit mask. + +Two real time tasks pid=1234 running on processor 0 and pid=5678 running on +processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy +neighbors, each of the two real-time tasks exclusively occupies one quarter +of L3 cache on socket 0. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by +ordinary tasks:: + + # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata + +Next we make a resource group for our first real time task and give +it access to the "top" 25% of the cache on socket 0. +:: + + # mkdir p0 + # echo "L3:0=f8000;1=fffff" > p0/schemata + +Finally we move our first real time task into this resource group. We +also use taskset(1) to ensure the task always runs on a dedicated CPU +on socket 0. Most uses of resource groups will also constrain which +processors tasks run on. +:: + + # echo 1234 > p0/tasks + # taskset -cp 1 1234 + +Ditto for the second real time task (with the remaining 25% of cache):: + + # mkdir p1 + # echo "L3:0=7c00;1=fffff" > p1/schemata + # echo 5678 > p1/tasks + # taskset -cp 2 5678 + +For the same 2 socket system with memory b/w resource and CAT L3 the +schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is +10): + +For our first real time task this would request 20% memory b/w on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +For our second real time task this would request an other 20% memory b/w +on socket 0. +:: + + # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata + +3) Example 3 + +A single socket system which has real-time tasks running on core 4-7 and +non real-time workload assigned to core 0-3. The real-time tasks share text +and data, so a per task association is not required and due to interaction +with the kernel it's desired that the kernel on these cores shares L3 with +the tasks. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + +First we reset the schemata for the default group so that the "upper" +50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 +cannot be used by ordinary tasks:: + + # echo "L3:0=3ff\nMB:0=50" > schemata + +Next we make a resource group for our real time cores and give it access +to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on +socket 0. +:: + + # mkdir p0 + # echo "L3:0=ffc00\nMB:0=50" > p0/schemata + +Finally we move core 4-7 over to the new group and make sure that the +kernel and the tasks running there get 50% of the cache. They should +also get 50% of memory bandwidth assuming that the cores 4-7 are SMT +siblings and only the real time threads are scheduled on the cores 4-7. +:: + + # echo F0 > p0/cpus + +4) Example 4 + +The resource groups in previous examples were all in the default "shareable" +mode allowing sharing of their cache allocations. If one resource group +configures a cache allocation then nothing prevents another resource group +to overlap with that allocation. + +In this example a new exclusive resource group will be created on a L2 CAT +system with two L2 cache instances that can be configured with an 8-bit +capacity bitmask. The new exclusive resource group will be configured to use +25% of each cache instance. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +First, we observe that the default group is configured to allocate to all L2 +cache:: + + # cat schemata + L2:0=ff;1=ff + +We could attempt to create the new resource group at this point, but it will +fail because of the overlap with the schemata of the default group:: + + # mkdir p0 + # echo 'L2:0=0x3;1=0x3' > p0/schemata + # cat p0/mode + shareable + # echo exclusive > p0/mode + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + schemata overlaps + +To ensure that there is no overlap with another resource group the default +resource group's schemata has to change, making it possible for the new +resource group to become exclusive. +:: + + # echo 'L2:0=0xfc;1=0xfc' > schemata + # echo exclusive > p0/mode + # grep . p0/* + p0/cpus:0 + p0/mode:exclusive + p0/schemata:L2:0=03;1=03 + p0/size:L2:0=262144;1=262144 + +A new resource group will on creation not overlap with an exclusive resource +group:: + + # mkdir p1 + # grep . p1/* + p1/cpus:0 + p1/mode:shareable + p1/schemata:L2:0=fc;1=fc + p1/size:L2:0=786432;1=786432 + +The bit_usage will reflect how the cache is used:: + + # cat info/L2/bit_usage + 0=SSSSSSEE;1=SSSSSSEE + +A resource group cannot be forced to overlap with an exclusive resource group:: + + # echo 'L2:0=0x1;1=0x1' > p1/schemata + -sh: echo: write error: Invalid argument + # cat info/last_cmd_status + overlaps with exclusive group + +Example of Cache Pseudo-Locking +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked +region is exposed at /dev/pseudo_lock/newlock that can be provided to +application for argument to mmap(). +:: + + # mount -t resctrl resctrl /sys/fs/resctrl/ + # cd /sys/fs/resctrl + +Ensure that there are bits available that can be pseudo-locked, since only +unused bits can be pseudo-locked the bits to be pseudo-locked needs to be +removed from the default resource group's schemata:: + + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSSS + # echo 'L2:1=0xfc' > schemata + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSS00 + +Create a new resource group that will be associated with the pseudo-locked +region, indicate that it will be used for a pseudo-locked region, and +configure the requested pseudo-locked region capacity bitmask:: + + # mkdir newlock + # echo pseudo-locksetup > newlock/mode + # echo 'L2:1=0x3' > newlock/schemata + +On success the resource group's mode will change to pseudo-locked, the +bit_usage will reflect the pseudo-locked region, and the character device +exposing the pseudo-locked region will exist:: + + # cat newlock/mode + pseudo-locked + # cat info/L2/bit_usage + 0=SSSSSSSS;1=SSSSSSPP + # ls -l /dev/pseudo_lock/newlock + crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock + +:: + + /* + * Example code to access one page of pseudo-locked cache region + * from user space. + */ + #define _GNU_SOURCE + #include <fcntl.h> + #include <sched.h> + #include <stdio.h> + #include <stdlib.h> + #include <unistd.h> + #include <sys/mman.h> + + /* + * It is required that the application runs with affinity to only + * cores associated with the pseudo-locked region. Here the cpu + * is hardcoded for convenience of example. + */ + static int cpuid = 2; + + int main(int argc, char *argv[]) + { + cpu_set_t cpuset; + long page_size; + void *mapping; + int dev_fd; + int ret; + + page_size = sysconf(_SC_PAGESIZE); + + CPU_ZERO(&cpuset); + CPU_SET(cpuid, &cpuset); + ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); + if (ret < 0) { + perror("sched_setaffinity"); + exit(EXIT_FAILURE); + } + + dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); + if (dev_fd < 0) { + perror("open"); + exit(EXIT_FAILURE); + } + + mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, + dev_fd, 0); + if (mapping == MAP_FAILED) { + perror("mmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + /* Application interacts with pseudo-locked memory @mapping */ + + ret = munmap(mapping, page_size); + if (ret < 0) { + perror("munmap"); + close(dev_fd); + exit(EXIT_FAILURE); + } + + close(dev_fd); + exit(EXIT_SUCCESS); + } + +Locking between applications +---------------------------- + +Certain operations on the resctrl filesystem, composed of read/writes +to/from multiple files, must be atomic. + +As an example, the allocation of an exclusive reservation of L3 cache +involves: + + 1. Read the cbmmasks from each directory or the per-resource "bit_usage" + 2. Find a contiguous set of bits in the global CBM bitmask that is clear + in any of the directory cbmmasks + 3. Create a new directory + 4. Set the bits found in step 2 to the new directory "schemata" file + +If two applications attempt to allocate space concurrently then they can +end up allocating the same bits so the reservations are shared instead of +exclusive. + +To coordinate atomic operations on the resctrlfs and to avoid the problem +above, the following locking procedure is recommended: + +Locking is based on flock, which is available in libc and also as a shell +script command + +Write lock: + + A) Take flock(LOCK_EX) on /sys/fs/resctrl + B) Read/write the directory structure. + C) funlock + +Read lock: + + A) Take flock(LOCK_SH) on /sys/fs/resctrl + B) If success read the directory structure. + C) funlock + +Example with bash:: + + # Atomically read directory structure + $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl + + # Read directory contents and create new subdirectory + + $ cat create-dir.sh + find /sys/fs/resctrl/ > output.txt + mask = function-of(output.txt) + mkdir /sys/fs/resctrl/newres/ + echo mask > /sys/fs/resctrl/newres/schemata + + $ flock /sys/fs/resctrl/ ./create-dir.sh + +Example with C:: + + /* + * Example code do take advisory locks + * before accessing resctrl filesystem + */ + #include <sys/file.h> + #include <stdlib.h> + + void resctrl_take_shared_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_SH); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_take_exclusive_lock(int fd) + { + int ret; + + /* release lock on resctrl filesystem */ + ret = flock(fd, LOCK_EX); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void resctrl_release_lock(int fd) + { + int ret; + + /* take shared lock on resctrl filesystem */ + ret = flock(fd, LOCK_UN); + if (ret) { + perror("flock"); + exit(-1); + } + } + + void main(void) + { + int fd, ret; + + fd = open("/sys/fs/resctrl", O_DIRECTORY); + if (fd == -1) { + perror("open"); + exit(-1); + } + resctrl_take_shared_lock(fd); + /* code to read directory contents */ + resctrl_release_lock(fd); + + resctrl_take_exclusive_lock(fd); + /* code to read and write directory contents */ + resctrl_release_lock(fd); + } + +Examples for RDT Monitoring along with allocation usage +======================================================= +Reading monitored data +---------------------- +Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would +show the current snapshot of LLC occupancy of the corresponding MON +group or CTRL_MON group. + + +Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) +------------------------------------------------------------------------ +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata + # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata + # echo 5678 > p1/tasks + # echo 5679 > p1/tasks + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Create monitor groups and assign a subset of tasks to each monitor group. +:: + + # cd /sys/fs/resctrl/p1/mon_groups + # mkdir m11 m12 + # echo 5678 > m11/tasks + # echo 5679 > m12/tasks + +fetch data (data shown in bytes) +:: + + # cat m11/mon_data/mon_L3_00/llc_occupancy + 16234000 + # cat m11/mon_data/mon_L3_01/llc_occupancy + 14789000 + # cat m12/mon_data/mon_L3_00/llc_occupancy + 16789000 + +The parent ctrl_mon group shows the aggregated data. +:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31234000 + +Example 2 (Monitor a task from its creation) +-------------------------------------------- +On a two socket machine (one L3 cache per socket):: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p0 p1 + +An RMID is allocated to the group once its created and hence the <cmd> +below is monitored from its creation. +:: + + # echo $$ > /sys/fs/resctrl/p1/tasks + # <cmd> + +Fetch the data:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy + 31789000 + +Example 3 (Monitor without CAT support or before creating CAT groups) +--------------------------------------------------------------------- + +Assume a system like HSW has only CQM and no CAT support. In this case +the resctrl will still mount but cannot create CTRL_MON directories. +But user can create different MON groups within the root group thereby +able to monitor all tasks including kernel threads. + +This can also be used to profile jobs cache size footprint before being +able to allocate them to different allocation groups. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir mon_groups/m01 + # mkdir mon_groups/m02 + + # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks + # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks + +Monitor the groups separately and also get per domain data. From the +below its apparent that the tasks are mostly doing work on +domain(socket) 0. +:: + + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy + 34555 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy + 31234000 + # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy + 32789 + + +Example 4 (Monitor real time tasks) +----------------------------------- + +A single socket system which has real time tasks running on cores 4-7 +and non real time tasks on other cpus. We want to monitor the cache +occupancy of the real time threads on these cores. +:: + + # mount -t resctrl resctrl /sys/fs/resctrl + # cd /sys/fs/resctrl + # mkdir p1 + +Move the cpus 4-7 over to p1:: + + # echo f0 > p1/cpus + +View the llc occupancy snapshot:: + + # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy + 11234000 + +Intel RDT Errata +================ + +Intel MBM Counters May Report System Memory Bandwidth Incorrectly +----------------------------------------------------------------- + +Errata SKX99 for Skylake server and BDF102 for Broadwell server. + +Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics +according to the assigned Resource Monitor ID (RMID) for that logical +core. The IA32_QM_CTR register (MSR 0xC8E), used to report these +metrics, may report incorrect system bandwidth for certain RMID values. + +Implication: Due to the errata, system memory bandwidth may not match +what is reported. + +Workaround: MBM total and local readings are corrected according to the +following correction factor table: + ++---------------+---------------+---------------+-----------------+ +|core count |rmid count |rmid threshold |correction factor| ++---------------+---------------+---------------+-----------------+ +|1 |8 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|2 |16 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|3 |24 |15 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|4 |32 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|6 |48 |31 |0.969650 | ++---------------+---------------+---------------+-----------------+ +|7 |56 |47 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|8 |64 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|9 |72 |63 |1.185115 | ++---------------+---------------+---------------+-----------------+ +|10 |80 |63 |1.066553 | ++---------------+---------------+---------------+-----------------+ +|11 |88 |79 |1.454545 | ++---------------+---------------+---------------+-----------------+ +|12 |96 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|13 |104 |95 |1.230769 | ++---------------+---------------+---------------+-----------------+ +|14 |112 |95 |1.142857 | ++---------------+---------------+---------------+-----------------+ +|15 |120 |95 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|16 |128 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|17 |136 |127 |1.254863 | ++---------------+---------------+---------------+-----------------+ +|18 |144 |127 |1.185255 | ++---------------+---------------+---------------+-----------------+ +|19 |152 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|20 |160 |127 |1.066667 | ++---------------+---------------+---------------+-----------------+ +|21 |168 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|22 |176 |159 |1.454334 | ++---------------+---------------+---------------+-----------------+ +|23 |184 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|24 |192 |127 |0.969744 | ++---------------+---------------+---------------+-----------------+ +|25 |200 |191 |1.280246 | ++---------------+---------------+---------------+-----------------+ +|26 |208 |191 |1.230921 | ++---------------+---------------+---------------+-----------------+ +|27 |216 |0 |1.000000 | ++---------------+---------------+---------------+-----------------+ +|28 |224 |191 |1.143118 | ++---------------+---------------+---------------+-----------------+ + +If rmid > rmid threshold, MBM total and local values should be multiplied +by the correction factor. + +See: + +1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update: +http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html + +2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update: +http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf + +3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual: +https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html + +for further information. diff --git a/Documentation/filesystems/smb/index.rst b/Documentation/filesystems/smb/index.rst index 1c8597a679ab..6df23b0e45c8 100644 --- a/Documentation/filesystems/smb/index.rst +++ b/Documentation/filesystems/smb/index.rst @@ -8,3 +8,4 @@ CIFS ksmbd cifsroot + smbdirect diff --git a/Documentation/filesystems/smb/smbdirect.rst b/Documentation/filesystems/smb/smbdirect.rst new file mode 100644 index 000000000000..ca6927c0b2c0 --- /dev/null +++ b/Documentation/filesystems/smb/smbdirect.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +SMB Direct - SMB3 over RDMA +=========================== + +This document describes how to set up the Linux SMB client and server to +use RDMA. + +Overview +======== +The Linux SMB kernel client supports SMB Direct, which is a transport +scheme for SMB3 that uses RDMA (Remote Direct Memory Access) to provide +high throughput and low latencies by bypassing the traditional TCP/IP +stack. +SMB Direct on the Linux SMB client can be tested against KSMBD - a +kernel-space SMB server. + +Installation +============= +- Install an RDMA device. As long as the RDMA device driver is supported + by the kernel, it should work. This includes both software emulators (soft + RoCE, soft iWARP) and hardware devices (InfiniBand, RoCE, iWARP). + +- Install a kernel with SMB Direct support. The first kernel release to + support SMB Direct on both the client and server side is 5.15. Therefore, + a distribution compatible with kernel 5.15 or later is required. + +- Install cifs-utils, which provides the `mount.cifs` command to mount SMB + shares. + +- Configure the RDMA stack + + Make sure that your kernel configuration has RDMA support enabled. Under + Device Drivers -> Infiniband support, update the kernel configuration to + enable Infiniband support. + + Enable the appropriate IB HCA support or iWARP adapter support, + depending on your hardware. + + If you are using InfiniBand, enable IP-over-InfiniBand support. + + For soft RDMA, enable either the soft iWARP (`RDMA _SIW`) or soft RoCE + (`RDMA_RXE`) module. Install the `iproute2` package and use the + `rdma link add` command to load the module and create an + RDMA interface. + + e.g. if your local ethernet interface is `eth0`, you can use: + + .. code-block:: bash + + sudo rdma link add siw0 type siw netdev eth0 + +- Enable SMB Direct support for both the server and the client in the kernel + configuration. + + Server Setup + + .. code-block:: text + + Network File Systems ---> + <M> SMB3 server support + [*] Support for SMB Direct protocol + + Client Setup + + .. code-block:: text + + Network File Systems ---> + <M> SMB3 and CIFS support (advanced network filesystem) + [*] SMB Direct support + +- Build and install the kernel. SMB Direct support will be enabled in the + cifs.ko and ksmbd.ko modules. + +Setup and Usage +================ + +- Set up and start a KSMBD server as described in the `KSMBD documentation + <https://www.kernel.org/doc/Documentation/filesystems/smb/ksmbd.rst>`_. + Also add the "server multi channel support = yes" parameter to ksmbd.conf. + +- On the client, mount the share with `rdma` mount option to use SMB Direct + (specify a SMB version 3.0 or higher using `vers`). + + For example: + + .. code-block:: bash + + mount -t cifs //server/share /mnt/point -o vers=3.1.1,rdma + +- To verify that the mount is using SMB Direct, you can check dmesg for the + following log line after mounting: + + .. code-block:: text + + CIFS: VFS: RDMA transport established + + Or, verify `rdma` mount option for the share in `/proc/mounts`: + + .. code-block:: bash + + cat /proc/mounts | grep cifs diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index bf051c7da6b8..fd32a9a17bfb 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -1390,9 +1390,7 @@ defined: If a vfsmount is returned, the caller will attempt to mount it on the mountpoint and will remove the vfsmount from its - expiration list in the case of failure. The vfsmount should be - returned with 2 refs on it to prevent automatic expiration - the - caller will clean up the additional ref. + expiration list in the case of failure. This function is only used if DCACHE_NEED_AUTOMOUNT is set on the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is |