8 files changed, 269 insertions, 31 deletions
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..aeb47ace738d
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+   EXT4's supports atomic write operations with a single filesystem block since
+   v6.13. In this the atomic write unit minimum and maximum sizes are both set
+   to filesystem blocksize.
+   e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+   pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+   EXT4 now also supports atomic writes spanning multiple filesystem blocks
+   using a feature known as bigalloc. The atomic write unit's minimum and
+   maximum sizes are determined by the filesystem block size and cluster size,
+   based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+    1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+    1. The bigalloc feature must be enabled
+    2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit.  Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+   extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+   mapped extents), ext4_map_blocks() is called in a loop with
+   EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+   mapped extent by writing zeroes to it and converting any unwritten extents to
+   written, if found within the range.
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+    could be a mixed mapping for the underlying requested range. If yes, then we
+    reserve credits of up to ``m_len``, assuming every alternate block can be
+    an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+    doing unwritten-to-written conversion. The loop for conversion is mainly
+    only required to handle a split extent across leaf blocks.
+
+How to
+~~~~~~
+
+Creating Filesystems with Atomic Write Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+    # Create an ext4 filesystem with a 16KB block size
+    # (requires page size >= 16KB)
+    mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+    # Create an ext4 filesystem with bigalloc and 64KB cluster size
+    mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+^^^^^^^^^^^^^^^^^^^^^
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+    pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+   separate memory buffers that can be gathered into a write operation
+   (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+~~~~~~~~~~~~~~~~
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+~~~~~~~~
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+  https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst
index 91c45d86e9bb..9d7d7b083a25 100644
--- a/Documentation/filesystems/ext4/bitmaps.rst
+++ b/Documentation/filesystems/ext4/bitmaps.rst
@@ -19,10 +19,3 @@ necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
 the bitmaps and group descriptor live inside the group. Unfortunately,
 ext2fs_test_block_bitmap2() will return '0' for those locations,
 which produces confusing debugfs output.
-
-Inode Table
------------
-Inode tables are statically allocated at mkfs time.  Each block group
-descriptor points to the start of the table, and the superblock records
-the number of inodes per group.  See the section on inodes for more
-information.
diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index ed5a5cac6d40..7cbf0b2b778e 100644
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
@@ -1,7 +1,10 @@
 .. SPDX-License-Identifier: GPL-2.0
 
+Block Groups
+------------
+
 Layout
-------
+~~~~~~
 
 The layout of a standard block group is approximately as follows (each
 of these fields is discussed in a separate section below):
@@ -60,7 +63,7 @@ groups (flex_bg). Leftover space is used for file data blocks, indirect
 block maps, extent tree blocks, and extended attributes.
 
 Flexible Block Groups
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 Starting in ext4, there is a new feature called flexible block groups
 (flex_bg). In a flex_bg, several block groups are tied together as one
@@ -78,7 +81,7 @@ if flex_bg is enabled. The number of block groups that make up a
 flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
 
 Meta Block Groups
------------------
+~~~~~~~~~~~~~~~~~
 
 Without the option META_BG, for safety concerns, all block group
 descriptors copies are kept in the first block group. Given the default
@@ -117,7 +120,7 @@ Please see an important note about ``BLOCK_UNINIT`` in the section about
 block and inode bitmaps.
 
 Lazy Block Group Initialization
--------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 A new feature for ext4 are three block group descriptor flags that
 enable mkfs to skip initializing other parts of the block group
diff --git a/Documentation/filesystems/ext4/dynamic.rst b/Documentation/filesystems/ext4/dynamic.rst
index bb0c84333341..bbad439aada2 100644
--- a/Documentation/filesystems/ext4/dynamic.rst
+++ b/Documentation/filesystems/ext4/dynamic.rst
@@ -6,7 +6,9 @@ Dynamic Structures
 Dynamic metadata are created on the fly when files and blocks are
 allocated to files.
 
-.. include:: inodes.rst
-.. include:: ifork.rst
-.. include:: directory.rst
-.. include:: attributes.rst
+.. toctree::
+
+   inodes
+   ifork
+   directory
+   attributes
diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/filesystems/ext4/globals.rst
index b17418974fd3..c6a6abce818a 100644
--- a/Documentation/filesystems/ext4/globals.rst
+++ b/Documentation/filesystems/ext4/globals.rst
@@ -6,9 +6,12 @@ Global Structures
 The filesystem is sharded into a number of block groups, each of which
 have static metadata at fixed locations.
 
-.. include:: super.rst
-.. include:: group_descr.rst
-.. include:: bitmaps.rst
-.. include:: mmp.rst
-.. include:: journal.rst
-.. include:: orphan.rst
+.. toctree::
+
+   super
+   group_descr
+   bitmaps
+   inode_table
+   mmp
+   journal
+   orphan
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
index 705d813d558f..1ff8150c50e9 100644
--- a/Documentation/filesystems/ext4/index.rst
+++ b/Documentation/filesystems/ext4/index.rst
@@ -5,7 +5,7 @@ ext4 Data Structures and Algorithms
 ===================================
 
 .. toctree::
-   :maxdepth: 6
+   :maxdepth: 2
    :numbered:
 
    about
diff --git a/Documentation/filesystems/ext4/inode_table.rst b/Documentation/filesystems/ext4/inode_table.rst
new file mode 100644
index 000000000000..f7900a52c0d5
--- /dev/null
+++ b/Documentation/filesystems/ext4/inode_table.rst
@@ -0,0 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Inode Table
+-----------
+
+Inode tables are statically allocated at mkfs time.  Each block group
+descriptor points to the start of the table, and the superblock records
+the number of inodes per group.  See :doc:`inode documentation <inodes>`
+for more information on inode table layout.
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 0fad6eda6e15..171c3963d7f6 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -16,12 +16,15 @@ All fields in ext4 are written to disk in little-endian order. HOWEVER,
 all fields in jbd2 (the journal) are written to disk in big-endian
 order.
 
-.. include:: blocks.rst
-.. include:: blockgroup.rst
-.. include:: special_inodes.rst
-.. include:: allocators.rst
-.. include:: checksums.rst
-.. include:: bigalloc.rst
-.. include:: inlinedata.rst
-.. include:: eainode.rst
-.. include:: verity.rst
+.. toctree::
+
+   blocks
+   blockgroup
+   special_inodes
+   allocators
+   checksums
+   bigalloc
+   inlinedata
+   eainode
+   verity
+   atomic_writes