diff options
Diffstat (limited to 'Documentation/admin-guide/device-mapper')
5 files changed, 209 insertions, 5 deletions
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 4d667228e744..a1e673c0e782 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,7 +3,7 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and/or flushs and optionally maps them to different devices. +and/or flushes and optionally maps them to different devices. Arguments:: @@ -18,7 +18,7 @@ Table line has to either have 3, 6 or 9 arguments: to write and flush operations on optionally different write_device with optionally different sector offset -9: same as 6 arguments plus define flush_offset and flush_delay explicitely +9: same as 6 arguments plus define flush_offset and flush_delay explicitly on/with optionally different flush_device/flush_offset. Offsets are specified in sectors. @@ -40,7 +40,7 @@ Example scripts #!/bin/sh # # Create mapped device delaying write and flush operations for 400ms and - # splitting reads to device $1 but writes and flushs to different device $2 + # splitting reads to device $1 but writes and flushes to different device $2 # to different offsets of 2048 and 4096 sectors respectively. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" @@ -48,7 +48,7 @@ Example scripts :: #!/bin/sh # - # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # Create mapped device delaying reads for 50ms, writes for 100ms and flushes for 333ms # onto the same backing device at offset 0 sectors. # dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-pcache.rst b/Documentation/admin-guide/device-mapper/dm-pcache.rst new file mode 100644 index 000000000000..09d327ef4b14 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-pcache.rst @@ -0,0 +1,202 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +dm-pcache — Persistent Cache +================================= + +*Author: Dongsheng Yang <dongsheng.yang@linux.dev>* + +This document describes *dm-pcache*, a Device-Mapper target that lets a +byte-addressable *DAX* (persistent-memory, “pmem”) region act as a +high-performance, crash-persistent cache in front of a slower block +device. The code lives in `drivers/md/dm-pcache/`. + +Quick feature summary +===================== + +* *Write-back* caching (only mode currently supported). +* *16 MiB segments* allocated on the pmem device. +* *Data CRC32* verification (optional, per cache). +* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX + == 2`) and protected with CRC+sequence numbers. +* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism +* Pure *DAX path* I/O – no extra BIO round-trips +* *Log-structured write-back* that preserves backend crash-consistency + + +Constructor +=========== + +:: + + pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>] + +========================= ==================================================== +``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…). + All metadata *and* cached blocks are stored here. + +``backing_dev`` The slow block device to be cached. + +``cache_mode`` Optional, Only ``writeback`` is accepted at the + moment. + +``data_crc`` Optional, default to ``false`` + + * ``true`` – store CRC32 for every cached entry + and verify on reads + * ``false`` – skip CRC (faster) +========================= ==================================================== + +Example +------- + +.. code-block:: shell + + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + +The first time a pmem device is used, dm-pcache formats it automatically +(super-block, cache_info, etc.). + + +Status line +=========== + +``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints: + +:: + + <sb_flags> <seg_total> <cache_segs> <segs_used> \ + <gc_percent> <cache_flags> \ + <key_head_seg>:<key_head_off> \ + <dirty_tail_seg>:<dirty_tail_off> \ + <key_tail_seg>:<key_tail_off> + +Field meanings +-------------- + +=============================== ============================================= +``sb_flags`` Super-block flags (e.g. endian marker). + +``seg_total`` Number of physical *pmem* segments. + +``cache_segs`` Number of segments used for cache. + +``segs_used`` Segments currently allocated (bitmap weight). + +``gc_percent`` Current GC high-water mark (0-90). + +``cache_flags`` Bit 0 – DATA_CRC enabled + Bit 1 – INIT_DONE (cache initialised) + Bits 2-5 – cache mode (0 == WB). + +``key_head`` Where new key-sets are being written. + +``dirty_tail`` First dirty key-set that still needs + write-back to the backing device. + +``key_tail`` First key-set that may be reclaimed by GC. +=============================== ============================================= + + +Messages +======== + +*Change GC trigger* + +:: + + dmsetup message <dev> 0 gc_percent <0-90> + + +Theory of operation +=================== + +Sub-devices +----------- + +==================== ========================================================= +backing_dev Any block device (SSD/HDD/loop/LVM, etc.). +cache_dev DAX device; must expose direct-access memory. +==================== ========================================================= + +Segments and key-sets +--------------------- + +* The pmem space is divided into *16 MiB segments*. +* Each write allocates space from a per-CPU *data_head* inside a segment. +* A *cache-key* records a logical range on the origin and where it lives + inside pmem (segment + offset + generation). +* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem + and are themselves crash-safe (CRC). +* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets. + +Write-back +---------- + +Dirty keys are queued into a tree; a background worker copies data +back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the +upper layers forces an immediate metadata commit. + +Garbage collection +------------------ + +GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks +from *key_tail*, frees segments whose every key has been invalidated, and +advances *key_tail*. + +CRC verification +---------------- + +If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data +range when it is inserted and stores it in the on-media key. Reads +validate the CRC before copying to the caller. + + +Failure handling +================ + +* *pmem media errors* – all metadata copies are read with + ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation. +* *Cache full* – if no free segment can be found, writes return ``-EBUSY``; + dm-pcache retries internally (request deferral). +* *System crash* – on attach, the driver replays ksets from *key_tail* to + rebuild the in-core trees; every segment’s generation guards against + use-after-free keys. + + +Limitations & TODO +================== + +* Only *write-back* mode; other modes planned. +* Only FIFO cache invalidate; other (LRU, ARC...) planned. +* Table reload is not supported currently. +* Discard planned. + + +Example workflow +================ + +.. code-block:: shell + + # 1. Create devices + dmsetup create pcache_sdb --table \ + "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true" + + # 2. Put a filesystem on top + mkfs.ext4 /dev/mapper/pcache_sdb + mount /dev/mapper/pcache_sdb /mnt + + # 3. Tune GC threshold to 80 % + dmsetup message pcache_sdb 0 gc_percent 80 + + # 4. Observe status + watch -n1 'dmsetup status pcache_sdb' + + # 5. Shutdown + umount /mnt + dmsetup remove pcache_sdb + + +``dm-pcache`` is under active development; feedback, bug reports and patches +are very welcome! diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index cc5aec861576..f1c1f4b824ba 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -18,6 +18,7 @@ Device Mapper dm-integrity dm-io dm-log + dm-pcache dm-queue-length dm-raid dm-service-time diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst index 3cd59decbec0..faa0ecd4a5ae 100644 --- a/Documentation/admin-guide/device-mapper/vdo-design.rst +++ b/Documentation/admin-guide/device-mapper/vdo-design.rst @@ -600,7 +600,7 @@ lock and return itself to the pool. All storage within vdo is managed as 4KB blocks, but it can accept writes as small as 512 bytes. Processing a write that is smaller than 4K requires a read-modify-write operation that reads the relevant 4K block, copies the -new data over the approriate sectors of the block, and then launches a +new data over the appropriate sectors of the block, and then launches a write operation for the modified data block. The read and write stages of this operation are nearly identical to the normal read and write operations, and a single data_vio is used throughout this operation. diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index a14e6d3e787c..8a67b320a97b 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only +====== dm-vdo ====== |