diff options
Diffstat (limited to 'Documentation/filesystems')
21 files changed, 670 insertions, 140 deletions
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst index 08dd5e254cc5..5b283f3d1eb1 100644 --- a/Documentation/filesystems/dax.rst +++ b/Documentation/filesystems/dax.rst @@ -206,7 +206,6 @@ stall the CPU for an extended period, you should also not attempt to implement direct_access. These block devices may be used for inspiration: -- brd: RAM backed block device driver - pmem: NVDIMM persistent memory driver diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst index f65767df3620..aeb47ace738d 100644 --- a/Documentation/filesystems/ext4/atomic_writes.rst +++ b/Documentation/filesystems/ext4/atomic_writes.rst @@ -148,10 +148,10 @@ reserved during: only required to handle a split extent across leaf blocks. How to ------- +~~~~~~ Creating Filesystems with Atomic Write Support -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First check the atomic write units supported by block device. See :ref:`atomic_write_bdev_support` for more details. @@ -176,7 +176,7 @@ Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in byte and ``-O bigalloc`` enables the bigalloc feature. Application Interface -~~~~~~~~~~~~~~~~~~~~~ +^^^^^^^^^^^^^^^^^^^^^ Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag to perform atomic writes: @@ -204,7 +204,7 @@ writes are supported. .. _atomic_write_bdev_support: Hardware Support ----------------- +~~~~~~~~~~~~~~~~ The underlying storage device must support atomic write operations. Modern NVMe and SCSI devices often provide this capability. @@ -217,7 +217,7 @@ Nonzero values for these attributes indicate that the device supports atomic writes. See Also --------- +~~~~~~~~ * :doc:`bigalloc` - Documentation on the bigalloc feature * :doc:`allocators` - Documentation on block allocation in ext4 diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst index 91c45d86e9bb..9d7d7b083a25 100644 --- a/Documentation/filesystems/ext4/bitmaps.rst +++ b/Documentation/filesystems/ext4/bitmaps.rst @@ -19,10 +19,3 @@ necessarily the case that no blocks are in use -- if ``meta_bg`` is set, the bitmaps and group descriptor live inside the group. Unfortunately, ext2fs_test_block_bitmap2() will return '0' for those locations, which produces confusing debugfs output. - -Inode Table ------------ -Inode tables are statically allocated at mkfs time. Each block group -descriptor points to the start of the table, and the superblock records -the number of inodes per group. See the section on inodes for more -information. diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst index ed5a5cac6d40..7cbf0b2b778e 100644 --- a/Documentation/filesystems/ext4/blockgroup.rst +++ b/Documentation/filesystems/ext4/blockgroup.rst @@ -1,7 +1,10 @@ .. SPDX-License-Identifier: GPL-2.0 +Block Groups +------------ + Layout ------- +~~~~~~ The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below): @@ -60,7 +63,7 @@ groups (flex_bg). Leftover space is used for file data blocks, indirect block maps, extent tree blocks, and extended attributes. Flexible Block Groups ---------------------- +~~~~~~~~~~~~~~~~~~~~~ Starting in ext4, there is a new feature called flexible block groups (flex_bg). In a flex_bg, several block groups are tied together as one @@ -78,7 +81,7 @@ if flex_bg is enabled. The number of block groups that make up a flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``. Meta Block Groups ------------------ +~~~~~~~~~~~~~~~~~ Without the option META_BG, for safety concerns, all block group descriptors copies are kept in the first block group. Given the default @@ -117,7 +120,7 @@ Please see an important note about ``BLOCK_UNINIT`` in the section about block and inode bitmaps. Lazy Block Group Initialization -------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A new feature for ext4 are three block group descriptor flags that enable mkfs to skip initializing other parts of the block group diff --git a/Documentation/filesystems/ext4/dynamic.rst b/Documentation/filesystems/ext4/dynamic.rst index bb0c84333341..bbad439aada2 100644 --- a/Documentation/filesystems/ext4/dynamic.rst +++ b/Documentation/filesystems/ext4/dynamic.rst @@ -6,7 +6,9 @@ Dynamic Structures Dynamic metadata are created on the fly when files and blocks are allocated to files. -.. include:: inodes.rst -.. include:: ifork.rst -.. include:: directory.rst -.. include:: attributes.rst +.. toctree:: + + inodes + ifork + directory + attributes diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/filesystems/ext4/globals.rst index b17418974fd3..c6a6abce818a 100644 --- a/Documentation/filesystems/ext4/globals.rst +++ b/Documentation/filesystems/ext4/globals.rst @@ -6,9 +6,12 @@ Global Structures The filesystem is sharded into a number of block groups, each of which have static metadata at fixed locations. -.. include:: super.rst -.. include:: group_descr.rst -.. include:: bitmaps.rst -.. include:: mmp.rst -.. include:: journal.rst -.. include:: orphan.rst +.. toctree:: + + super + group_descr + bitmaps + inode_table + mmp + journal + orphan diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst index 705d813d558f..1ff8150c50e9 100644 --- a/Documentation/filesystems/ext4/index.rst +++ b/Documentation/filesystems/ext4/index.rst @@ -5,7 +5,7 @@ ext4 Data Structures and Algorithms =================================== .. toctree:: - :maxdepth: 6 + :maxdepth: 2 :numbered: about diff --git a/Documentation/filesystems/ext4/inode_table.rst b/Documentation/filesystems/ext4/inode_table.rst new file mode 100644 index 000000000000..f7900a52c0d5 --- /dev/null +++ b/Documentation/filesystems/ext4/inode_table.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Inode Table +----------- + +Inode tables are statically allocated at mkfs time. Each block group +descriptor points to the start of the table, and the superblock records +the number of inodes per group. See :doc:`inode documentation <inodes>` +for more information on inode table layout. diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst index 9d4054c17ecb..171c3963d7f6 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -16,13 +16,15 @@ All fields in ext4 are written to disk in little-endian order. HOWEVER, all fields in jbd2 (the journal) are written to disk in big-endian order. -.. include:: blocks.rst -.. include:: blockgroup.rst -.. include:: special_inodes.rst -.. include:: allocators.rst -.. include:: checksums.rst -.. include:: bigalloc.rst -.. include:: inlinedata.rst -.. include:: eainode.rst -.. include:: verity.rst -.. include:: atomic_writes.rst +.. toctree:: + + blocks + blockgroup + special_inodes + allocators + checksums + bigalloc + inlinedata + eainode + verity + atomic_writes diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 440e4ae74e44..8eeb7ea14f61 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -218,7 +218,7 @@ mode=%s Control block allocation mode which supports "adaptive" fragmentation/after-GC situation itself. The developers use these modes to understand filesystem fragmentation/after-GC condition well, and eventually get some insights to handle them better. - In "fragment:segment", f2fs allocates a new segment in ramdom + In "fragment:segment", f2fs allocates a new segment in random position. With this, we can simulate the after-GC condition. In "fragment:block", we can scatter block allocation with "max_fragment_chunk" and "max_fragment_hole" sysfs nodes. @@ -261,7 +261,7 @@ test_dummy_encryption=%s The argument may be either "v1" or "v2", in order to select the corresponding fscrypt policy version. checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable" - to reenable checkpointing. Is enabled by default. While + to re-enable checkpointing. Is enabled by default. While disabled, any unmounting or unexpected shutdowns will cause the filesystem contents to appear as they did when the filesystem was mounted with that option. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 29e84d125e02..696a5844bfa3 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -147,9 +147,8 @@ However, these ioctls have some limitations: were wiped. To partially solve this, you can add init_on_free=1 to your kernel command line. However, this has a performance cost. -- Secret keys might still exist in CPU registers, in crypto - accelerator hardware (if used by the crypto API to implement any of - the algorithms), or in other places not explicitly considered here. +- Secret keys might still exist in CPU registers or in other places + not explicitly considered here. Full system compromise ~~~~~~~~~~~~~~~~~~~~~~ @@ -406,9 +405,12 @@ the work is done by XChaCha12, which is much faster than AES when AES acceleration is unavailable. For more information about Adiantum, see `the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_. -The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support -systems whose only form of AES acceleration is an off-CPU crypto -accelerator such as CAAM or CESA that does not support XTS. +The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair was added to try to +provide a more efficient option for systems that lack AES instructions +in the CPU but do have a non-inline crypto engine such as CAAM or CESA +that supports AES-CBC (and not AES-XTS). This is deprecated. It has +been shown that just doing AES on the CPU is actually faster. +Moreover, Adiantum is faster still and is recommended on such systems. The remaining mode pairs are the "national pride ciphers": @@ -468,14 +470,6 @@ API, but the filenames mode still does. - Recommended: - AES-CBC acceleration -fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512 -acceleration is recommended: - -- SHA-512 - - Recommended: - - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE - - x86: CONFIG_CRYPTO_SHA512_SSSE3 - Contents encryption ------------------- @@ -1326,22 +1320,13 @@ this by validating all top-level encryption policies prior to access. Inline encryption support ========================= -By default, fscrypt uses the kernel crypto API for all cryptographic -operations (other than HKDF, which fscrypt partially implements -itself). The kernel crypto API supports hardware crypto accelerators, -but only ones that work in the traditional way where all inputs and -outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can -take advantage of such hardware, but the traditional acceleration -model isn't particularly efficient and fscrypt hasn't been optimized -for it. - -Instead, many newer systems (especially mobile SoCs) have *inline -encryption hardware* that can encrypt/decrypt data while it is on its -way to/from the storage device. Linux supports inline encryption -through a set of extensions to the block layer called *blk-crypto*. -blk-crypto allows filesystems to attach encryption contexts to bios -(I/O requests) to specify how the data will be encrypted or decrypted -in-line. For more information about blk-crypto, see +Many newer systems (especially mobile SoCs) have *inline encryption +hardware* that can encrypt/decrypt data while it is on its way to/from +the storage device. Linux supports inline encryption through a set of +extensions to the block layer called *blk-crypto*. blk-crypto allows +filesystems to attach encryption contexts to bios (I/O requests) to +specify how the data will be encrypted or decrypted in-line. For more +information about blk-crypto, see :ref:`Documentation/block/inline-encryption.rst <inline_encryption>`. On supported filesystems (currently ext4 and f2fs), fscrypt can use diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst index dacdbc1149e6..412cf11e3298 100644 --- a/Documentation/filesystems/fsverity.rst +++ b/Documentation/filesystems/fsverity.rst @@ -185,8 +185,7 @@ FS_IOC_ENABLE_VERITY can fail with the following errors: - ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate needed to verify the builtin signature - ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not - available in the kernel's crypto API as currently configured (e.g. - for SHA-512, missing CONFIG_CRYPTO_SHA512). + available in the kernel as currently configured - ``ENOTTY``: this type of filesystem does not implement fs-verity - ``EOPNOTSUPP``: the kernel was not configured with fs-verity support; or the filesystem superblock has not had the 'verity' diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst index f2df9b6df988..0f7672676c0b 100644 --- a/Documentation/filesystems/iomap/design.rst +++ b/Documentation/filesystems/iomap/design.rst @@ -167,7 +167,6 @@ structure below: struct dax_device *dax_dev; void *inline_data; void *private; - const struct iomap_folio_ops *folio_ops; u64 validity_cookie; }; @@ -292,8 +291,6 @@ The fields are as follows: <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. This value will be passed unchanged to ``->iomap_end``. - * ``folio_ops`` will be covered in the section on pagecache operations. - * ``validity_cookie`` is a magic freshness value set by the filesystem that should be used to detect stale mappings. For pagecache operations this is critical for correct operation diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst index 3b628e370d88..067ed8e14ef3 100644 --- a/Documentation/filesystems/iomap/operations.rst +++ b/Documentation/filesystems/iomap/operations.rst @@ -57,21 +57,19 @@ The following address space operations can be wrapped easily: * ``bmap`` * ``swap_activate`` -``struct iomap_folio_ops`` +``struct iomap_write_ops`` -------------------------- -The ``->iomap_begin`` function for pagecache operations may set the -``struct iomap::folio_ops`` field to an ops structure to override -default behaviors of iomap: - .. code-block:: c - struct iomap_folio_ops { + struct iomap_write_ops { struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos, unsigned len); void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied, struct folio *folio); bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap); + int (*read_folio_range)(const struct iomap_iter *iter, + struct folio *folio, loff_t pos, size_t len); }; iomap calls these functions: @@ -127,6 +125,10 @@ iomap calls these functions: ``->iomap_valid``, then the iomap should considered stale and the validation failed. + - ``read_folio_range``: Called to synchronously read in the range that will + be written to. If this function is not provided, iomap will default to + submitting a bio read request. + These ``struct kiocb`` flags are significant for buffered I/O with iomap: * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``. @@ -271,7 +273,7 @@ writeback. It does not lock ``i_rwsem`` or ``invalidate_lock``. The dirty bit will be cleared for all folios run through the -``->map_blocks`` machinery described below even if the writeback fails. +``->writeback_range`` machinery described below even if the writeback fails. This is to prevent dirty folio clots when storage devices fail; an ``-EIO`` is recorded for userspace to collect via ``fsync``. @@ -283,15 +285,14 @@ The ``ops`` structure must be specified and is as follows: .. code-block:: c struct iomap_writeback_ops { - int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode, - loff_t offset, unsigned len); - int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status); - void (*discard_folio)(struct folio *folio, loff_t pos); + int (*writeback_range)(struct iomap_writepage_ctx *wpc, + struct folio *folio, u64 pos, unsigned int len, u64 end_pos); + int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error); }; The fields are as follows: - - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file + - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file range (in bytes) given by ``offset`` and ``len``. iomap calls this function for each dirty fs block in each dirty folio, though it will `reuse mappings @@ -306,27 +307,26 @@ The fields are as follows: This revalidation must be open-coded by the filesystem; it is unclear if ``iomap::validity_cookie`` can be reused for this purpose. - This function must be supplied by the filesystem. - - - ``submit_ioend``: Allows the file systems to hook into writeback bio - submission. - This might include pre-write space accounting updates, or installing - a custom ``->bi_end_io`` function for internal purposes, such as - deferring the ioend completion to a workqueue to run metadata update - transactions from process context before submitting the bio. - This function is optional. - - ``discard_folio``: iomap calls this function after ``->map_blocks`` - fails to schedule I/O for any part of a dirty folio. - The function should throw away any reservations that may have been - made for the write. + If this methods fails to schedule I/O for any part of a dirty folio, it + should throw away any reservations that may have been made for the write. The folio will be marked clean and an ``-EIO`` recorded in the pagecache. Filesystems can use this callback to `remove <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_ delalloc reservations to avoid having delalloc reservations for clean pagecache. - This function is optional. + This function must be supplied by the filesystem. + + - ``writeback_submit``: Submit the previous built writeback context. + Block based file systems should use the iomap_ioend_writeback_submit + helper, other file system can implement their own. + File systems can optionall to hook into writeback bio submission. + This might include pre-write space accounting updates, or installing + a custom ``->bi_end_io`` function for internal purposes, such as + deferring the ioend completion to a workqueue to run metadata update + transactions from process context before submitting the bio. + This function must be supplied by the filesystem. Pagecache Writeback Completion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -340,10 +340,9 @@ If the write failed, it will also set the error bits on the folios and the address space. This can happen in interrupt or process context, depending on the storage device. - Filesystems that need to update internal bookkeeping (e.g. unwritten -extent conversions) should provide a ``->submit_ioend`` function to -set ``struct iomap_end::bio::bi_end_io`` to its own function. +extent conversions) should set their own bi_end_io on the bios +submitted by ``->submit_writeback`` This function should call ``iomap_finish_ioends`` after finishing its own work (e.g. unwritten extent conversion). diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 2e567e341c3b..aa287ccdac2f 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -87,8 +87,8 @@ prototypes:: int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t); int (*fileattr_set)(struct mnt_idmap *idmap, - struct dentry *dentry, struct fileattr *fa); - int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); + struct dentry *dentry, struct file_kattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa); struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); struct offset_ctx *(*get_offset_ctx)(struct inode *inode); @@ -253,10 +253,10 @@ prototypes:: int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *folio); void (*readahead)(struct readahead_control *); - int (*write_begin)(struct file *, struct address_space *mapping, + int (*write_begin)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, struct folio **foliop, void **fsdata); - int (*write_end)(struct file *, struct address_space *mapping, + int (*write_end)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 4133a336486d..ab989807a2cb 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -9,7 +9,7 @@ Overlay Filesystem This document describes a prototype for a new approach to providing overlay-filesystem functionality in Linux (sometimes referred to as union-filesystems). An overlay-filesystem tries to present a -filesystem which is the result over overlaying one filesystem on top +filesystem which is the result of overlaying one filesystem on top of the other. @@ -61,7 +61,7 @@ Inode properties |Configuration | Persistent | Uniform | st_ino == d_ino | d_ino == i_ino | | | st_ino | st_dev | | [*] | +==============+=====+======+=====+======+========+========+========+=======+ -| | dir | !dir | dir | !dir | dir + !dir | dir | !dir | +| | dir | !dir | dir | !dir | dir | !dir | dir | !dir | +--------------+-----+------+-----+------+--------+--------+--------+-------+ | All layers | Y | Y | Y | Y | Y | Y | Y | Y | | on same fs | | | | | | | | | @@ -425,7 +425,7 @@ of information from up to three different layers: The "lower data" file can be on any lower layer, except from the top most lower layer. -Below the top most lower layer, any number of lower most layers may be defined +Below the topmost lower layer, any number of lowermost layers may be defined as "data-only" lower layers, using double colon ("::") separators. A normal lower layer is not allowed to be below a data-only layer, so single colon separators are not allowed to the right of double colon ("::") separators. @@ -445,8 +445,8 @@ to the absolute path of the "lower data" file in the "data-only" lower layer. Instead of explicitly enabling "metacopy=on" it is sufficient to specify at least one data-only layer to enable redirection of data to a data-only layer. -In this case other forms of metacopy are rejected. Note: this way data-only -layers may be used toghether with "userxattr", in which case careful attention +In this case other forms of metacopy are rejected. Note: this way, data-only +layers may be used together with "userxattr", in which case careful attention must be given to privileges needed to change the "user.overlay.redirect" xattr to prevent misuse. @@ -515,7 +515,7 @@ supports these values: The metacopy digest is never generated or used. This is the default if verity option is not specified. - "on": - Whenever a metacopy files specifies an expected digest, the + Whenever a metacopy file specifies an expected digest, the corresponding data file must match the specified digest. When generating a metacopy file the verity digest will be set in it based on the source file (if it has one). @@ -537,7 +537,7 @@ Using an upper layer path and/or a workdir path that are already used by another overlay mount is not allowed and may fail with EBUSY. Using partially overlapping paths is not allowed and may fail with EBUSY. If files are accessed from two overlayfs mounts which share or overlap the -upper layer and/or workdir path the behavior of the overlay is undefined, +upper layer and/or workdir path, the behavior of the overlay is undefined, though it will not result in a crash or deadlock. Mounting an overlay using an upper layer path, where the upper layer path @@ -778,7 +778,7 @@ controlled by the "uuid" mount option, which supports these values: - "auto": (default) UUID is taken from xattr "trusted.overlay.uuid" if it exists. Upgrade to "uuid=on" on first time mount of new overlay filesystem that - meets the prerequites. + meets the prerequisites. Downgrade to "uuid=null" for existing overlay filesystems that were never mounted with "uuid=on". @@ -794,20 +794,20 @@ without significant effort. The advantage of mounting with the "volatile" option is that all forms of sync calls to the upper filesystem are omitted. -In order to avoid a giving a false sense of safety, the syncfs (and fsync) +In order to avoid giving a false sense of safety, the syncfs (and fsync) semantics of volatile mounts are slightly different than that of the rest of VFS. If any writeback error occurs on the upperdir's filesystem after a volatile mount takes place, all sync functions will return an error. Once this condition is reached, the filesystem will not recover, and every subsequent sync -call will return an error, even if the upperdir has not experience a new error +call will return an error, even if the upperdir has not experienced a new error since the last sync call. When overlay is mounted with "volatile" option, the directory "$workdir/work/incompat/volatile" is created. During next mount, overlay checks for this directory and refuses to mount if present. This is a strong -indicator that user should throw away upper and work directories and create -fresh one. In very limited cases where the user knows that the system has -not crashed and contents of upperdir are intact, The "volatile" directory +indicator that the user should discard upper and work directories and create +fresh ones. In very limited cases where the user knows that the system has +not crashed and contents of upperdir are intact, the "volatile" directory can be removed. diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 3616d7161dab..85f590254f07 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1224,9 +1224,6 @@ lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now take a qstr instead of separate name and length. QSTR() can be used when strlen() is needed for the length. -For try_lookup_noperm() a reference to the qstr is passed in case the -hash might subsequently be needed. - These function no longer do any permission checking - they previously checked that the caller has 'X' permission on the parent. They must ONLY be used internally by a filesystem on itself when it knows that @@ -1249,3 +1246,42 @@ Using try_lookup_noperm() will require linux/namei.h to be included. Calling conventions for ->d_automount() have changed; we should *not* grab an extra reference to new mount - it should be returned with refcount 1. + +--- + +collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now. +Replacement is collect_paths()/drop_collected_path(), with no special +iterator needed. Instead of a cloned mount tree, the new interface returns +an array of struct path, one for each mount collect_mounts() would've +created. These struct path point to locations in the caller's namespace +that would be roots of the cloned mounts. + +--- + +**mandatory** + +If your filesystem sets the default dentry_operations, use set_default_d_op() +rather than manually setting sb->s_d_op. + +--- + +**mandatory** + +d_set_d_op() is no longer exported (or public, for that matter); _if_ +your filesystem really needed that, make use of d_splice_alias_ops() +to have them set. Better yet, think hard whether you need different +->d_op for different dentries - if not, just use set_default_d_op() +at mount time and be done with that. Currently procfs is the only +thing that really needs ->d_op varying between dentries. + +--- + +**highly recommended** + +The file operations mmap() callback is deprecated in favour of +mmap_prepare(). This passes a pointer to a vm_area_desc to the callback +rather than a VMA, as the VMA at this stage is not yet valid. + +The vm_area_desc provides the minimum required information for a filesystem +to initialise state upon memory mapping of a file-backed region, and output +parameters for the file system to set this state. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2a17865dfe39..2971551b7235 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -584,7 +584,6 @@ encoded manner. The codes are the following: ms may share gd stack segment growns down pf pure PFN range - dw disabled write to the mapped file lo pages are locked in memory io memory mapped I/O area sr sequential read advise provided @@ -607,8 +606,11 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking + ui userfaultfd minor fault ss shadow/guarded control stack page sl sealed + lf lock on fault pages + dp always lazily freeable mapping == ======================================= Note that there is no guarantee that every flag and associated mnemonic will @@ -1194,12 +1196,14 @@ SecPageTables Memory consumed by secondary page tables, this currently includes KVM mmu and IOMMU allocations on x86 and arm64. NFS_Unstable - Always zero. Previous counted pages which had been written to + Always zero. Previously counted pages which had been written to the server, but has not been committed to stable storage. Bounce - Memory used for block device "bounce buffers" + Always zero. Previously memory used for block device + "bounce buffers". WritebackTmp - Memory used by FUSE for temporary writeback buffers + Always zero. Previously memory used by FUSE for temporary + writeback buffers. CommitLimit Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt new file mode 100644 index 000000000000..c90349e5b889 --- /dev/null +++ b/Documentation/filesystems/propagate_umount.txt @@ -0,0 +1,484 @@ + Notes on propagate_umount() + +Umount propagation starts with a set of mounts we are already going to +take out. Ideally, we would like to add all downstream cognates to +that set - anything with the same mountpoint as one of the removed +mounts and with parent that would receive events from the parent of that +mount. However, there are some constraints the resulting set must +satisfy. + +It is convenient to define several properties of sets of mounts: + +1) A set S of mounts is non-shifting if for any mount X belonging +to S all subtrees mounted strictly inside of X (i.e. not overmounting +the root of X) contain only elements of S. + +2) A set S is non-revealing if all locked mounts that belong to S have +parents that also belong to S. + +3) A set S is closed if it contains all children of its elements. + +The set of mounts taken out by umount(2) must be non-shifting and +non-revealing; the first constraint is what allows to reparent +any remaining mounts and the second is what prevents the exposure +of any concealed mountpoints. + +propagate_umount() takes the original set as an argument and tries to +extend that set. The original set is a full subtree and its root is +unlocked; what matters is that it's closed and non-revealing. +Resulting set may not be closed; there might still be mounts outside +of that set, but only on top of stacks of root-overmounting elements +of set. They can be reparented to the place where the bottom of +stack is attached to a mount that will survive. NOTE: doing that +will violate a constraint on having no more than one mount with +the same parent/mountpoint pair; however, the caller (umount_tree()) +will immediately remedy that - it may keep unmounted element attached +to parent, but only if the parent itself is unmounted. Since all +conflicts created by reparenting have common parent *not* in the +set and one side of the conflict (bottom of the stack of overmounts) +is in the set, it will be resolved. However, we rely upon umount_tree() +doing that pretty much immediately after the call of propagate_umount(). + +Algorithm is based on two statements: + 1) for any set S, there is a maximal non-shifting subset of S +and it can be calculated in O(#S) time. + 2) for any non-shifting set S, there is a maximal non-revealing +subset of S. That subset is also non-shifting and it can be calculated +in O(#S) time. + + Finding candidates. + +We are given a closed set U and we want to find all mounts that have +the same mountpoint as some mount m in U *and* whose parent receives +propagation from the parent of the same mount m. Naive implementation +would be + S = {} + for each m in U + add m to S + p = parent(m) + for each q in Propagation(p) - {p} + child = look_up(q, mountpoint(m)) + if child + add child to S +but that can lead to excessive work - there might be propagation among the +subtrees of U, in which case we'd end up examining the same candidates +many times. Since propagation is transitive, the same will happen to +everything downstream of that candidate and it's not hard to construct +cases where the approach above leads to the time quadratic by the actual +number of candidates. + +Note that if we run into a candidate we'd already seen, it must've been +added on an earlier iteration of the outer loop - all additions made +during one iteration of the outer loop have different parents. So +if we find a child already added to the set, we know that everything +in Propagation(parent(child)) with the same mountpoint has been already +added. + S = {} + for each m in U + if m in S + continue + add m to S + p = parent(m) + q = propagation_next(p, p) + while q + child = look_up(q, mountpoint(m)) + if child + if child in S + q = skip_them(q, p) + continue; + add child to S + q = propagation_next(q, p) +where +skip_them(q, p) + keep walking Propagation(p) from q until we find something + not in Propagation(q) + +would get rid of that problem, but we need a sane implementation of +skip_them(). That's not hard to do - split propagation_next() into +"down into mnt_slave_list" and "forward-and-up" parts, with the +skip_them() being "repeat the forward-and-up part until we get NULL +or something that isn't a peer of the one we are skipping". + +Note that there can be no absolute roots among the extra candidates - +they all come from mount lookups. Absolute root among the original +set is _currently_ impossible, but it might be worth protecting +against. + + Maximal non-shifting subsets. + +Let's call a mount m in a set S forbidden in that set if there is a +subtree mounted strictly inside m and containing mounts that do not +belong to S. + +The set is non-shifting when none of its elements are forbidden in it. + +If mount m is forbidden in a set S, it is forbidden in any subset S' it +belongs to. In other words, it can't belong to any of the non-shifting +subsets of S. If we had a way to find a forbidden mount or show that +there's none, we could use it to find the maximal non-shifting subset +simply by finding and removing them until none remain. + +Suppose mount m is forbidden in S; then any mounts forbidden in S - {m} +must have been forbidden in S itself. Indeed, since m has descendents +that do not belong to S, any subtree that fits into S will fit into +S - {m} as well. + +So in principle we could go through elements of S, checking if they +are forbidden in S and removing the ones that are. Removals will +not invalidate the checks done for earlier mounts - if they were not +forbidden at the time we checked, they won't become forbidden later. +It's too costly to be practical, but there is a similar approach that +is linear by size of S. + +Let's say that mount x in a set S is forbidden by mount y, if + * both x and y belong to S. + * there is a chain of mounts starting at x and leaving S + immediately after passing through y, with the first + mountpoint strictly inside x. +Note 1: x may be equal to y - that's the case when something not +belonging to S is mounted strictly inside x. +Note 2: if y does not belong to S, it can't forbid anything in S. +Note 3: if y has no children outside of S, it can't forbid anything in S. + +It's easy to show that mount x is forbidden in S if and only if x is +forbidden in S by some mount y. And it's easy to find all mounts in S +forbidden by a given mount. + +Consider the following operation: + Trim(S, m) = S - {x : x is forbidden by m in S} + +Note that if m does not belong to S or has no children outside of S we +are guaranteed that Trim(S, m) is equal to S. + +The following is true: if x is forbidden by y in Trim(S, m), it was +already forbidden by y in S. + +Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a +chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1} +is the first element that doesn't belong to Trim(S, m) and the +mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must +have been removed by Trim(S, m), i.e. it was forbidden in S by m. +Then there was a mount chain from r to some child of m that stayed in +S all the way until m, but that's impossible since x belongs to Trim(S, m) +and prepending (x_0, ..., x_k) to that chain demonstrates that x is also +forbidden in S by m, and thus can't belong to Trim(S, m). +Therefore r can not belong to S and our chain demonstrates that +x is forbidden by y in S. QED. + +Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any +such mount would have been forbidden by m in S and thus would have been +in the part of S removed in Trim(S, m). + +Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed, +any such would have to have been forbidden by m in Trim(S, m), which +is impossible. + +Corollary: after + S = Trim(S, x_1) + S = Trim(S, x_2) + ... + S = Trim(S, x_k) +no mount remaining in S will be forbidden by either of x_1,...,x_k. + +The following will reduce S to its maximal non-shifting subset: + visited = {} + while S contains elements not belonging to visited + let m be an arbitrary such element of S + S = Trim(S, m) + add m to visited + +S never grows, so the number of elements of S not belonging to visited +decreases at least by one on each iteration. When the loop terminates, +all mounts remaining in S belong to visited. It's easy to see that at +the beginning of each iteration no mount remaining in S will be forbidden +by any element of visited. In other words, no mount remaining in S will +be forbidden, i.e. final value of S will be non-shifting. It will be +the maximal non-shifting subset, since we were removing only forbidden +elements. + + There are two difficulties in implementing the above in linear +time, both due to the fact that Trim() might need to remove more than one +element. Naive implementation of Trim() is vulnerable to running into a +long chain of mounts, each mounted on top of parent's root. Nothing in +that chain is forbidden, so nothing gets removed from it. We need to +recognize such chains and avoid walking them again on subsequent calls of +Trim(), otherwise we will end up with worst-case time being quadratic by +the number of elements in S. Another difficulty is in implementing the +outer loop - we need to iterate through all elements of a shrinking set. +That would be trivial if we never removed more than one element at a time +(linked list, with list_for_each_entry_safe for iterator), but we may +need to remove more than one entry, possibly including the ones we have +already visited. + + Let's start with naive algorithm for Trim(): + +Trim_one(m) + found = false + for each n in children(m) + if n not in S + found = true + if (mountpoint(n) != root(m)) + remove m from S + break + if found + Trim_ancestors(m) + +Trim_ancestors(m) + for (; parent(m) in S; m = parent(m)) { + if (mountpoint(m) != root(parent(m))) + remove parent(m) from S + } + +If m belongs to S, Trim_one(m) will replace S with Trim(S, m). +Proof: + Consider the chains excluding elements from Trim(S, m). The last +two elements in such chain are m and some child of m that does not belong +to S. If m has no such children, Trim(S, m) is equal to S. + m itself is removed if and only if the chain has exactly two +elements, i.e. when the last element does not overmount the root of m. +In other words, that happens when m has a child not in S that does not +overmount the root of m. + All other elements to remove will be ancestors of m, such that +the entire descent chain from them to m is contained in S. Let +(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be +removed if and only if x_{i+1} does not overmount its root. It's easy +to see that Trim_ancestors(m) will iterate through that chain from +x_k to x_1 and that it will remove exactly the elements that need to be +removed. + + Note that if the loop in Trim_ancestors() walks into an already +visited element, we are guaranteed that remaining iterations will see +only elements that had already been visited and remove none of them. +That's the weakness that makes it vulnerable to long chains of full +overmounts. + + It's easy to deal with, if we can afford setting marks on +elements of S; we would mark all elements already visited by +Trim_ancestors() and have it bail out as soon as it sees an already +marked element. + + The problems with iterating through the set can be dealt with in +several ways, depending upon the representation we choose for our set. +One useful observation is that we are given a closed subset in S - the +original set passed to propagate_umount(). Its elements can neither +forbid anything nor be forbidden by anything - all their descendents +belong to S, so they can not occur anywhere in any excluding chain. +In other words, the elements of that subset will remain in S until +the end and Trim_one(S, m) is a no-op for all m from that subset. + + That suggests keeping S as a disjoint union of a closed set U +('will be unmounted, no matter what') and the set of all elements of +S that do not belong to U. That set ('candidates') is all we need +to iterate through. Let's represent it as a subset in a cyclic list, +consisting of all list elements that are marked as candidates (initially - +all of them). Then we could have Trim_ancestors() only remove the mark, +leaving the elements on the list. Then Trim_one() would never remove +anything other than its argument from the containing list, allowing to +use list_for_each_entry_safe() as iterator. + + Assuming that representation we get the following: + + list_for_each_entry_safe(m, ..., Candidates, ...) + Trim_one(m) +where +Trim_one(m) + if (m is not marked as a candidate) + strip the "seen by Trim_ancestors" mark from m + remove m from the Candidates list + return + + remove_this = false + found = false + for each n in children(m) + if n not in S + found = true + if (mountpoint(n) != root(m)) + remove_this = true + break + if found + Trim_ancestors(m) + if remove_this + strip the "seen by Trim_ancestors" mark from m + strip the "candidate" mark from m + remove m from the Candidate list + +Trim_ancestors(m) + for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) { + if m is marked as seen by Trim_ancestors + return + mark m as seen by Trim_ancestors + if (mountpoint(m) != root(p)) + strip the "candidate" mark from p + } + + Terminating condition in the loop in Trim_ancestors() is correct, +since that that loop will never run into p belonging to U - p is always +an ancestor of argument of Trim_one() and since U is closed, the argument +of Trim_one() would also have to belong to U. But Trim_one() is never +called for elements of U. In other words, p belongs to S if and only +if it belongs to candidates. + + Time complexity: +* we get no more than O(#S) calls of Trim_one() +* the loop over children in Trim_one() never looks at the same child +twice through all the calls. +* iterations of that loop for children in S are no more than O(#S) +in the worst case +* at most two children that are not elements of S are considered per +call of Trim_one(). +* the loop in Trim_ancestors() sets its mark once per iteration and +no element of S has is set more than once. + + In the end we may have some elements excluded from S by +Trim_ancestors() still stuck on the list. We could do a separate +loop removing them from the list (also no worse than O(#S) time), +but it's easier to leave that until the next phase - there we will +iterate through the candidates anyway. + + The caller has already removed all elements of U from their parents' +lists of children, which means that checking if child belongs to S is +equivalent to checking if it's marked as a candidate; we'll never see +the elements of U in the loop over children in Trim_one(). + + What's more, if we see that children(m) is empty and m is not +locked, we can immediately move m into the committed subset (remove +from the parent's list of children, etc.). That's one fewer mount we'll +have to look into when we check the list of children of its parent *and* +when we get to building the non-revealing subset. + + Maximal non-revealing subsets + +If S is not a non-revealing subset, there is a locked element x in S +such that parent of x is not in S. + +Obviously, no non-revealing subset of S may contain x. Removing such +elements one by one will obviously end with the maximal non-revealing +subset (possibly empty one). Note that removal of an element will +require removal of all its locked children, etc. + +If the set had been non-shifting, it will remain non-shifting after +such removals. +Proof: suppose S was non-shifting, x is a locked element of S, parent of x +is not in S and S - {x} is not non-shifting. Then there is an element m +in S - {x} and a subtree mounted strictly inside m, such that m contains +an element not in in S - {x}. Since S is non-shifting, everything in +that subtree must belong to S. But that means that this subtree must +contain x somewhere *and* that parent of x either belongs that subtree +or is equal to m. Either way it must belong to S. Contradiction. + +// same representation as for finding maximal non-shifting subsets: +// S is a disjoint union of a non-revealing set U (the ones we are committed +// to unmount) and a set of candidates, represented as a subset of list +// elements that have "is a candidate" mark on them. +// Elements of U are removed from their parents' lists of children. +// In the end candidates becomes empty and maximal non-revealing non-shifting +// subset of S is now in U + while (Candidates list is non-empty) + handle_locked(first(Candidates)) + +handle_locked(m) + if m is not marked as a candidate + strip the "seen by Trim_ancestors" mark from m + remove m from the list + return + cutoff = m + for (p = m; p in candidates; p = parent(p)) { + strip the "seen by Trim_ancestors" mark from p + strip the "candidate" mark from p + remove p from the Candidates list + if (!locked(p)) + cutoff = parent(p) + } + if p in U + cutoff = p + while m != cutoff + remove m from children(parent(m)) + add m to U + m = parent(m) + +Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S. +* If it contains some elements of U, let x_k be the last one of those. +Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing. +* otherwise if all its elements are locked, then none of {x_0, ..., x_n} +may be elements of a non-revealing subset of S. +* otherwise let x_k be the first unlocked element of the chain. Then none +of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of +S and union of U and {x_k, ..., x_n} is non-revealing. + +handle_locked(m) finds which of these cases applies and adjusts Candidates +and U accordingly. U remains non-revealing, union of Candidates and +U still contains any non-revealing subset of S and after the call of +handle_locked(m) m is guaranteed to be not in Candidates list. So having +it called for each element of S would suffice to empty Candidates, +leaving U the maximal non-revealing subset of S. + +However, handle_locked(m) is a no-op when m belongs to U, so it's enough +to have it called for elements of Candidates list until none remain. + +Time complexity: number of calls of handle_locked() is limited by +#Candidates, each iteration of the first loop in handle_locked() removes +an element from the list, so their total number of executions is also +limited by #Candidates; number of iterations in the second loop is no +greater than the number of iterations of the first loop. + + + Reparenting + +After we'd calculated the final set, we still need to deal with +reparenting - if an element of the final set has a child not in it, +we need to reparent such child. + +Such children can only be root-overmounting (otherwise the set wouldn't +be non-shifting) and their parents can not belong to the original set, +since the original is guaranteed to be closed. + + + Putting all of that together + +The plan is to + * find all candidates + * trim down to maximal non-shifting subset + * trim down to maximal non-revealing subset + * reparent anything that needs to be reparented + * return the resulting set to the caller + +For the 2nd and 3rd steps we want to separate the set into growing +non-revealing subset, initially containing the original set ("U" in +terms of the pseudocode above) and everything we are still not sure about +("candidates"). It means that for the output of the 1st step we'd like +the extra candidates separated from the stuff already in the original set. +For the 4th step we would like the additions to U separate from the +original set. + +So let's go for + * original set ("set"). Linkage via mnt_list + * undecided candidates ("candidates"). Subset of a list, +consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE). +Initially all elements of the list will be marked that way; in the +end the list will become empty and no mounts will remain marked with +that flag. + * Reuse T_MARKED for "has been already seen by trim_ancestors()". + * anything in U that hadn't been in the original set - elements of +candidates will gradually be either discarded or moved there. In other +words, it's the candidates we have already decided to unmount. Its role +is reasonably close to the old "to_umount", so let's use that name. +Linkage via mnt_list. + +For gather_candidates() we'll need to maintain both candidates (S - +set) and intersection of S with set. Use T_UMOUNT_CANDIDATE for +all elements we encounter, putting the ones not already in the original +set into the list of candidates. When we are done, strip that flag from +all elements of the original set. That gives a cheap way to check +if element belongs to S (in gather_candidates) and to candidates +itself (at later stages). Call that predicate is_candidate(); it would +be m->mnt_t_flags & T_UMOUNT_CANDIDATE. + +All elements of the original set are marked with MNT_UMOUNT and we'll +need the same for elements added when joining the contents of to_umount +to set in the end. Let's set MNT_UMOUNT at the time we add an element +to to_umount; that's close to what the old 'umount_one' is doing, so +let's keep that name. It also gives us another predicate we need - +"belongs to union of set and to_umount"; will_be_unmounted() for now. + +Removals from the candidates list should strip both T_MARKED and +T_UMOUNT_CANDIDATE; call it remove_from_candidates_list(). diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst index 3d85ee88719a..106bb9c056f6 100644 --- a/Documentation/filesystems/ubifs-authentication.rst +++ b/Documentation/filesystems/ubifs-authentication.rst @@ -443,6 +443,6 @@ References [DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst -[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html +[FSCRYPT-POLICY2] https://lore.kernel.org/r/20171023214058.128121-1-ebiggers3@gmail.com/ [UBIFS-WP] http://www.linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index fd32a9a17bfb..486a91633474 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -515,8 +515,8 @@ As of kernel 2.6.22, the following members are defined: struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int); int (*fileattr_set)(struct mnt_idmap *idmap, - struct dentry *dentry, struct fileattr *fa); - int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); + struct dentry *dentry, struct file_kattr *fa); + int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa); struct offset_ctx *(*get_offset_ctx)(struct inode *inode); }; @@ -758,8 +758,9 @@ process is more complicated and uses write_begin/write_end or dirty_folio to write data into the address_space, and writepages to writeback data to storage. -Adding and removing pages to/from an address_space is protected by the -inode's i_mutex. +Removing pages from an address_space requires holding the inode's i_rwsem +exclusively, while adding pages to the address_space requires holding the +inode's i_mapping->invalidate_lock exclusively. When data is written to a page, the PG_Dirty flag should be set. It typically remains set until writepages asks for it to be written. This @@ -822,10 +823,10 @@ cache in your filesystem. The following members are defined: int (*writepages)(struct address_space *, struct writeback_control *); bool (*dirty_folio)(struct address_space *, struct folio *); void (*readahead)(struct readahead_control *); - int (*write_begin)(struct file *, struct address_space *mapping, + int (*write_begin)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, - struct page **pagep, void **fsdata); - int (*write_end)(struct file *, struct address_space *mapping, + struct page **pagep, void **fsdata); + int (*write_end)(const struct kiocb *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct folio *folio, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); @@ -1071,12 +1072,14 @@ This describes how the VFS can manipulate an open file. As of kernel struct file_operations { struct module *owner; + fop_flags_t fop_flags; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); - int (*iopoll)(struct kiocb *kiocb, bool spin); + int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *, + unsigned int flags); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); @@ -1093,18 +1096,24 @@ This describes how the VFS can manipulate an open file. As of kernel int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); - int (*setlease)(struct file *, long, struct file_lock **, void **); + void (*splice_eof)(struct file *file); + int (*setlease)(struct file *, int, struct file_lease **, void **); long (*fallocate)(struct file *file, int mode, loff_t offset, loff_t len); void (*show_fdinfo)(struct seq_file *m, struct file *f); #ifndef CONFIG_MMU unsigned (*mmap_capabilities)(struct file *); #endif - ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t len, unsigned int remap_flags); int (*fadvise)(struct file *, loff_t, loff_t, int); + int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); + int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, + unsigned int poll_flags); + int (*mmap_prepare)(struct vm_area_desc *); }; Again, all methods are called without any locks being held, unless @@ -1144,7 +1153,8 @@ otherwise noted. used on 64 bit kernels. ``mmap`` - called by the mmap(2) system call + called by the mmap(2) system call. Deprecated in favour of + ``mmap_prepare``. ``open`` called by the VFS when an inode should be opened. When the VFS @@ -1221,6 +1231,11 @@ otherwise noted. ``fadvise`` possibly called by the fadvise64() system call. +``mmap_prepare`` + Called by the mmap(2) system call. Allows a VFS to set up a + file-backed memory mapping, most notably establishing relevant + private state and VMA callbacks. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special |