From b0a4aa950c68b5010831ecfc450510c64e4d80ba Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 18 Apr 2019 14:21:14 -0300 Subject: docs: nvdimm: convert to ReST Rename the nvdimm documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Dan Williams --- Documentation/nvdimm/btt.rst | 285 ++++++++++++ Documentation/nvdimm/btt.txt | 273 ------------ Documentation/nvdimm/index.rst | 12 + Documentation/nvdimm/nvdimm.rst | 887 ++++++++++++++++++++++++++++++++++++++ Documentation/nvdimm/nvdimm.txt | 815 ---------------------------------- Documentation/nvdimm/security.rst | 143 ++++++ Documentation/nvdimm/security.txt | 141 ------ 7 files changed, 1327 insertions(+), 1229 deletions(-) create mode 100644 Documentation/nvdimm/btt.rst delete mode 100644 Documentation/nvdimm/btt.txt create mode 100644 Documentation/nvdimm/index.rst create mode 100644 Documentation/nvdimm/nvdimm.rst delete mode 100644 Documentation/nvdimm/nvdimm.txt create mode 100644 Documentation/nvdimm/security.rst delete mode 100644 Documentation/nvdimm/security.txt (limited to 'Documentation/nvdimm') diff --git a/Documentation/nvdimm/btt.rst b/Documentation/nvdimm/btt.rst new file mode 100644 index 000000000000..2d8269f834bd --- /dev/null +++ b/Documentation/nvdimm/btt.rst @@ -0,0 +1,285 @@ +============================= +BTT - Block Translation Table +============================= + + +1. Introduction +=============== + +Persistent memory based storage is able to perform IO at byte (or more +accurately, cache line) granularity. However, we often want to expose such +storage as traditional block devices. The block drivers for persistent memory +will do exactly this. However, they do not provide any atomicity guarantees. +Traditional SSDs typically provide protection against torn sectors in hardware, +using stored energy in capacitors to complete in-flight block writes, or perhaps +in firmware. We don't have this luxury with persistent memory - if a write is in +progress, and we experience a power failure, the block will contain a mix of old +and new data. Applications may not be prepared to handle such a scenario. + +The Block Translation Table (BTT) provides atomic sector update semantics for +persistent memory devices, so that applications that rely on sector writes not +being torn can continue to do so. The BTT manifests itself as a stacked block +device, and reserves a portion of the underlying storage for its metadata. At +the heart of it, is an indirection table that re-maps all the blocks on the +volume. It can be thought of as an extremely simple file system that only +provides atomic sector updates. + + +2. Static Layout +================ + +The underlying storage on which a BTT can be laid out is not limited in any way. +The BTT, however, splits the available space into chunks of up to 512 GiB, +called "Arenas". + +Each arena follows the same layout for its metadata, and all references in an +arena are internal to it (with the exception of one field that points to the +next arena). The following depicts the "On-disk" metadata layout:: + + + Backing Store +-------> Arena + +---------------+ | +------------------+ + | | | | Arena info block | + | Arena 0 +---+ | 4K | + | 512G | +------------------+ + | | | | + +---------------+ | | + | | | | + | Arena 1 | | Data Blocks | + | 512G | | | + | | | | + +---------------+ | | + | . | | | + | . | | | + | . | | | + | | | | + | | | | + +---------------+ +------------------+ + | | + | BTT Map | + | | + | | + +------------------+ + | | + | BTT Flog | + | | + +------------------+ + | Info block copy | + | 4K | + +------------------+ + + +3. Theory of Operation +====================== + + +a. The BTT Map +-------------- + +The map is a simple lookup/indirection table that maps an LBA to an internal +block. Each map entry is 32 bits. The two most significant bits are special +flags, and the remaining form the internal block number. + +======== ============================================================= +Bit Description +======== ============================================================= +31 - 30 Error and Zero flags - Used in the following way: + + == == ==================================================== + 31 30 Description + == == ==================================================== + 0 0 Initial state. Reads return zeroes; Premap = Postmap + 0 1 Zero state: Reads return zeroes + 1 0 Error state: Reads fail; Writes clear 'E' bit + 1 1 Normal Block – has valid postmap + == == ==================================================== + +29 - 0 Mappings to internal 'postmap' blocks +======== ============================================================= + + +Some of the terminology that will be subsequently used: + +============ ================================================================ +External LBA LBA as made visible to upper layers. +ABA Arena Block Address - Block offset/number within an arena +Premap ABA The block offset into an arena, which was decided upon by range + checking the External LBA +Postmap ABA The block number in the "Data Blocks" area obtained after + indirection from the map +nfree The number of free blocks that are maintained at any given time. + This is the number of concurrent writes that can happen to the + arena. +============ ================================================================ + + +For example, after adding a BTT, we surface a disk of 1024G. We get a read for +the external LBA at 768G. This falls into the second arena, and of the 512G +worth of blocks that this arena contributes, this block is at 256G. Thus, the +premap ABA is 256G. We now refer to the map, and find out the mapping for block +'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. + + +b. The BTT Flog +--------------- + +The BTT provides sector atomicity by making every write an "allocating write", +i.e. Every write goes to a "free" block. A running list of free blocks is +maintained in the form of the BTT flog. 'Flog' is a combination of the words +"free list" and "log". The flog contains 'nfree' entries, and an entry contains: + +======== ===================================================================== +lba The premap ABA that is being written to +old_map The old postmap ABA - after 'this' write completes, this will be a + free block. +new_map The new postmap ABA. The map will up updated to reflect this + lba->postmap_aba mapping, but we log it here in case we have to + recover. +seq Sequence number to mark which of the 2 sections of this flog entry is + valid/newest. It cycles between 01->10->11->01 (binary) under normal + operation, with 00 indicating an uninitialized state. +lba' alternate lba entry +old_map' alternate old postmap entry +new_map' alternate new postmap entry +seq' alternate sequence number. +======== ===================================================================== + +Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also +padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are +done such that for any entry being written, it: +a. overwrites the 'old' section in the entry based on sequence numbers +b. writes the 'new' section such that the sequence number is written last. + + +c. The concept of lanes +----------------------- + +While 'nfree' describes the number of concurrent IOs an arena can process +concurrently, 'nlanes' is the number of IOs the BTT device as a whole can +process:: + + nlanes = min(nfree, num_cpus) + +A lane number is obtained at the start of any IO, and is used for indexing into +all the on-disk and in-memory data structures for the duration of the IO. If +there are more CPUs than the max number of available lanes, than lanes are +protected by spinlocks. + + +d. In-memory data structure: Read Tracking Table (RTT) +------------------------------------------------------ + +Consider a case where we have two threads, one doing reads and the other, +writes. We can hit a condition where the writer thread grabs a free block to do +a new IO, but the (slow) reader thread is still reading from it. In other words, +the reader consulted a map entry, and started reading the corresponding block. A +writer started writing to the same external LBA, and finished the write updating +the map for that external LBA to point to its new postmap ABA. At this point the +internal, postmap block that the reader is (still) reading has been inserted +into the list of free blocks. If another write comes in for the same LBA, it can +grab this free block, and start writing to it, causing the reader to read +incorrect data. To prevent this, we introduce the RTT. + +The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts +into rtt[lane_number], the postmap ABA it is reading, and clears it after the +read is complete. Every writer thread, after grabbing a free block, checks the +RTT for its presence. If the postmap free block is in the RTT, it waits till the +reader clears the RTT entry, and only then starts writing to it. + + +e. In-memory data structure: map locks +-------------------------------------- + +Consider a case where two writer threads are writing to the same LBA. There can +be a race in the following sequence of steps:: + + free[lane] = map[premap_aba] + map[premap_aba] = postmap_aba + +Both threads can update their respective free[lane] with the same old, freed +postmap_aba. This has made the layout inconsistent by losing a free entry, and +at the same time, duplicating another free entry for two lanes. + +To solve this, we could have a single map lock (per arena) that has to be taken +before performing the above sequence, but we feel that could be too contentious. +Instead we use an array of (nfree) map_locks that is indexed by +(premap_aba modulo nfree). + + +f. Reconstruction from the Flog +------------------------------- + +On startup, we analyze the BTT flog to create our list of free blocks. We walk +through all the entries, and for each lane, of the set of two possible +'sections', we always look at the most recent one only (based on the sequence +number). The reconstruction rules/steps are simple: + +- Read map[log_entry.lba]. +- If log_entry.new matches the map entry, then log_entry.old is free. +- If log_entry.new does not match the map entry, then log_entry.new is free. + (This case can only be caused by power-fails/unsafe shutdowns) + + +g. Summarizing - Read and Write flows +------------------------------------- + +Read: + +1. Convert external LBA to arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Read map to get the entry for this pre-map ABA +4. Enter post-map ABA into RTT[lane] +5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) +6. If ERROR flag set in map, end IO with EIO (go to step 8) +7. Read data from this block +8. Remove post-map ABA entry from RTT[lane] +9. Release lane (and lane_lock) + +Write: + +1. Convert external LBA to Arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Use lane to index into in-memory free list and obtain a new block, next flog + index, next sequence number +4. Scan the RTT to check if free block is present, and spin/wait if it is. +5. Write data to this free block +6. Read map to get the existing post-map ABA entry for this pre-map ABA +7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] +8. Write new post-map ABA into map. +9. Write old post-map entry into the free list +10. Calculate next sequence number and write into the free list entry +11. Release lane (and lane_lock) + + +4. Error Handling +================= + +An arena would be in an error state if any of the metadata is corrupted +irrecoverably, either due to a bug or a media error. The following conditions +indicate an error: + +- Info block checksum does not match (and recovering from the copy also fails) +- All internal available blocks are not uniquely and entirely addressed by the + sum of mapped blocks and free blocks (from the BTT flog). +- Rebuilding free list from the flog reveals missing/duplicate/impossible + entries +- A map entry is out of bounds + +If any of these error conditions are encountered, the arena is put into a read +only state using a flag in the info block. + + +5. Usage +======== + +The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem +(pmem, or blk mode). The easiest way to set up such a namespace is using the +'ndctl' utility [1]: + +For example, the ndctl command line to setup a btt with a 4k sector size is:: + + ndctl create-namespace -f -e namespace0.0 -m sector -l 4k + +See ndctl create-namespace --help for more options. + +[1]: https://github.com/pmem/ndctl diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.txt deleted file mode 100644 index e293fb664924..000000000000 --- a/Documentation/nvdimm/btt.txt +++ /dev/null @@ -1,273 +0,0 @@ -BTT - Block Translation Table -============================= - - -1. Introduction ---------------- - -Persistent memory based storage is able to perform IO at byte (or more -accurately, cache line) granularity. However, we often want to expose such -storage as traditional block devices. The block drivers for persistent memory -will do exactly this. However, they do not provide any atomicity guarantees. -Traditional SSDs typically provide protection against torn sectors in hardware, -using stored energy in capacitors to complete in-flight block writes, or perhaps -in firmware. We don't have this luxury with persistent memory - if a write is in -progress, and we experience a power failure, the block will contain a mix of old -and new data. Applications may not be prepared to handle such a scenario. - -The Block Translation Table (BTT) provides atomic sector update semantics for -persistent memory devices, so that applications that rely on sector writes not -being torn can continue to do so. The BTT manifests itself as a stacked block -device, and reserves a portion of the underlying storage for its metadata. At -the heart of it, is an indirection table that re-maps all the blocks on the -volume. It can be thought of as an extremely simple file system that only -provides atomic sector updates. - - -2. Static Layout ----------------- - -The underlying storage on which a BTT can be laid out is not limited in any way. -The BTT, however, splits the available space into chunks of up to 512 GiB, -called "Arenas". - -Each arena follows the same layout for its metadata, and all references in an -arena are internal to it (with the exception of one field that points to the -next arena). The following depicts the "On-disk" metadata layout: - - - Backing Store +-------> Arena -+---------------+ | +------------------+ -| | | | Arena info block | -| Arena 0 +---+ | 4K | -| 512G | +------------------+ -| | | | -+---------------+ | | -| | | | -| Arena 1 | | Data Blocks | -| 512G | | | -| | | | -+---------------+ | | -| . | | | -| . | | | -| . | | | -| | | | -| | | | -+---------------+ +------------------+ - | | - | BTT Map | - | | - | | - +------------------+ - | | - | BTT Flog | - | | - +------------------+ - | Info block copy | - | 4K | - +------------------+ - - -3. Theory of Operation ----------------------- - - -a. The BTT Map --------------- - -The map is a simple lookup/indirection table that maps an LBA to an internal -block. Each map entry is 32 bits. The two most significant bits are special -flags, and the remaining form the internal block number. - -Bit Description -31 - 30 : Error and Zero flags - Used in the following way: - Bit Description - 31 30 - ----------------------------------------------------------------------- - 00 Initial state. Reads return zeroes; Premap = Postmap - 01 Zero state: Reads return zeroes - 10 Error state: Reads fail; Writes clear 'E' bit - 11 Normal Block – has valid postmap - - -29 - 0 : Mappings to internal 'postmap' blocks - - -Some of the terminology that will be subsequently used: - -External LBA : LBA as made visible to upper layers. -ABA : Arena Block Address - Block offset/number within an arena -Premap ABA : The block offset into an arena, which was decided upon by range - checking the External LBA -Postmap ABA : The block number in the "Data Blocks" area obtained after - indirection from the map -nfree : The number of free blocks that are maintained at any given time. - This is the number of concurrent writes that can happen to the - arena. - - -For example, after adding a BTT, we surface a disk of 1024G. We get a read for -the external LBA at 768G. This falls into the second arena, and of the 512G -worth of blocks that this arena contributes, this block is at 256G. Thus, the -premap ABA is 256G. We now refer to the map, and find out the mapping for block -'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. - - -b. The BTT Flog ---------------- - -The BTT provides sector atomicity by making every write an "allocating write", -i.e. Every write goes to a "free" block. A running list of free blocks is -maintained in the form of the BTT flog. 'Flog' is a combination of the words -"free list" and "log". The flog contains 'nfree' entries, and an entry contains: - -lba : The premap ABA that is being written to -old_map : The old postmap ABA - after 'this' write completes, this will be a - free block. -new_map : The new postmap ABA. The map will up updated to reflect this - lba->postmap_aba mapping, but we log it here in case we have to - recover. -seq : Sequence number to mark which of the 2 sections of this flog entry is - valid/newest. It cycles between 01->10->11->01 (binary) under normal - operation, with 00 indicating an uninitialized state. -lba' : alternate lba entry -old_map': alternate old postmap entry -new_map': alternate new postmap entry -seq' : alternate sequence number. - -Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also -padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are -done such that for any entry being written, it: -a. overwrites the 'old' section in the entry based on sequence numbers -b. writes the 'new' section such that the sequence number is written last. - - -c. The concept of lanes ------------------------ - -While 'nfree' describes the number of concurrent IOs an arena can process -concurrently, 'nlanes' is the number of IOs the BTT device as a whole can -process. - nlanes = min(nfree, num_cpus) -A lane number is obtained at the start of any IO, and is used for indexing into -all the on-disk and in-memory data structures for the duration of the IO. If -there are more CPUs than the max number of available lanes, than lanes are -protected by spinlocks. - - -d. In-memory data structure: Read Tracking Table (RTT) ------------------------------------------------------- - -Consider a case where we have two threads, one doing reads and the other, -writes. We can hit a condition where the writer thread grabs a free block to do -a new IO, but the (slow) reader thread is still reading from it. In other words, -the reader consulted a map entry, and started reading the corresponding block. A -writer started writing to the same external LBA, and finished the write updating -the map for that external LBA to point to its new postmap ABA. At this point the -internal, postmap block that the reader is (still) reading has been inserted -into the list of free blocks. If another write comes in for the same LBA, it can -grab this free block, and start writing to it, causing the reader to read -incorrect data. To prevent this, we introduce the RTT. - -The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts -into rtt[lane_number], the postmap ABA it is reading, and clears it after the -read is complete. Every writer thread, after grabbing a free block, checks the -RTT for its presence. If the postmap free block is in the RTT, it waits till the -reader clears the RTT entry, and only then starts writing to it. - - -e. In-memory data structure: map locks --------------------------------------- - -Consider a case where two writer threads are writing to the same LBA. There can -be a race in the following sequence of steps: - -free[lane] = map[premap_aba] -map[premap_aba] = postmap_aba - -Both threads can update their respective free[lane] with the same old, freed -postmap_aba. This has made the layout inconsistent by losing a free entry, and -at the same time, duplicating another free entry for two lanes. - -To solve this, we could have a single map lock (per arena) that has to be taken -before performing the above sequence, but we feel that could be too contentious. -Instead we use an array of (nfree) map_locks that is indexed by -(premap_aba modulo nfree). - - -f. Reconstruction from the Flog -------------------------------- - -On startup, we analyze the BTT flog to create our list of free blocks. We walk -through all the entries, and for each lane, of the set of two possible -'sections', we always look at the most recent one only (based on the sequence -number). The reconstruction rules/steps are simple: -- Read map[log_entry.lba]. -- If log_entry.new matches the map entry, then log_entry.old is free. -- If log_entry.new does not match the map entry, then log_entry.new is free. - (This case can only be caused by power-fails/unsafe shutdowns) - - -g. Summarizing - Read and Write flows -------------------------------------- - -Read: - -1. Convert external LBA to arena number + pre-map ABA -2. Get a lane (and take lane_lock) -3. Read map to get the entry for this pre-map ABA -4. Enter post-map ABA into RTT[lane] -5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) -6. If ERROR flag set in map, end IO with EIO (go to step 8) -7. Read data from this block -8. Remove post-map ABA entry from RTT[lane] -9. Release lane (and lane_lock) - -Write: - -1. Convert external LBA to Arena number + pre-map ABA -2. Get a lane (and take lane_lock) -3. Use lane to index into in-memory free list and obtain a new block, next flog - index, next sequence number -4. Scan the RTT to check if free block is present, and spin/wait if it is. -5. Write data to this free block -6. Read map to get the existing post-map ABA entry for this pre-map ABA -7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] -8. Write new post-map ABA into map. -9. Write old post-map entry into the free list -10. Calculate next sequence number and write into the free list entry -11. Release lane (and lane_lock) - - -4. Error Handling -================= - -An arena would be in an error state if any of the metadata is corrupted -irrecoverably, either due to a bug or a media error. The following conditions -indicate an error: -- Info block checksum does not match (and recovering from the copy also fails) -- All internal available blocks are not uniquely and entirely addressed by the - sum of mapped blocks and free blocks (from the BTT flog). -- Rebuilding free list from the flog reveals missing/duplicate/impossible - entries -- A map entry is out of bounds - -If any of these error conditions are encountered, the arena is put into a read -only state using a flag in the info block. - - -5. Usage -======== - -The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem -(pmem, or blk mode). The easiest way to set up such a namespace is using the -'ndctl' utility [1]: - -For example, the ndctl command line to setup a btt with a 4k sector size is: - - ndctl create-namespace -f -e namespace0.0 -m sector -l 4k - -See ndctl create-namespace --help for more options. - -[1]: https://github.com/pmem/ndctl - diff --git a/Documentation/nvdimm/index.rst b/Documentation/nvdimm/index.rst new file mode 100644 index 000000000000..1a3402d3775e --- /dev/null +++ b/Documentation/nvdimm/index.rst @@ -0,0 +1,12 @@ +:orphan: + +=================================== +Non-Volatile Memory Device (NVDIMM) +=================================== + +.. toctree:: + :maxdepth: 1 + + nvdimm + btt + security diff --git a/Documentation/nvdimm/nvdimm.rst b/Documentation/nvdimm/nvdimm.rst new file mode 100644 index 000000000000..08f855cbb4e6 --- /dev/null +++ b/Documentation/nvdimm/nvdimm.rst @@ -0,0 +1,887 @@ +=============================== +LIBNVDIMM: Non-Volatile Devices +=============================== + +libnvdimm - kernel / libndctl - userspace helper library + +linux-nvdimm@lists.01.org + +Version 13 + +.. contents: + + Glossary + Overview + Supporting Documents + Git Trees + LIBNVDIMM PMEM and BLK + Why BLK? + PMEM vs BLK + BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX + Example NVDIMM Platform + LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API + LIBNDCTL: Context + libndctl: instantiate a new library context example + LIBNVDIMM/LIBNDCTL: Bus + libnvdimm: control class device in /sys/class + libnvdimm: bus + libndctl: bus enumeration example + LIBNVDIMM/LIBNDCTL: DIMM (NMEM) + libnvdimm: DIMM (NMEM) + libndctl: DIMM enumeration example + LIBNVDIMM/LIBNDCTL: Region + libnvdimm: region + libndctl: region enumeration example + Why Not Encode the Region Type into the Region Name? + How Do I Determine the Major Type of a Region? + LIBNVDIMM/LIBNDCTL: Namespace + libnvdimm: namespace + libndctl: namespace enumeration example + libndctl: namespace creation example + Why the Term "namespace"? + LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" + libnvdimm: btt layout + libndctl: btt creation example + Summary LIBNDCTL Diagram + + +Glossary +======== + +PMEM: + A system-physical-address range where writes are persistent. A + block device composed of PMEM is capable of DAX. A PMEM address range + may span an interleave of several DIMMs. + +BLK: + A set of one or more programmable memory mapped apertures provided + by a DIMM to access its media. This indirection precludes the + performance benefit of interleaving, but enables DIMM-bounded failure + modes. + +DPA: + DIMM Physical Address, is a DIMM-relative offset. With one DIMM in + the system there would be a 1:1 system-physical-address:DPA association. + Once more DIMMs are added a memory controller interleave must be + decoded to determine the DPA associated with a given + system-physical-address. BLK capacity always has a 1:1 relationship + with a single-DIMM's DPA range. + +DAX: + File system extensions to bypass the page cache and block layer to + mmap persistent memory, from a PMEM block device, directly into a + process address space. + +DSM: + Device Specific Method: ACPI method to to control specific + device - in this case the firmware. + +DCR: + NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. + It defines a vendor-id, device-id, and interface format for a given DIMM. + +BTT: + Block Translation Table: Persistent memory is byte addressable. + Existing software may have an expectation that the power-fail-atomicity + of writes is at least one sector, 512 bytes. The BTT is an indirection + table with atomic update semantics to front a PMEM/BLK block device + driver and present arbitrary atomic sector sizes. + +LABEL: + Metadata stored on a DIMM device that partitions and identifies + (persistently names) storage between PMEM and BLK. It also partitions + BLK storage to host BTTs with different parameters per BLK-partition. + Note that traditional partition tables, GPT/MBR, are layered on top of a + BLK or PMEM device. + + +Overview +======== + +The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, +PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM +and BLK mode access. These three modes of operation are described by +the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM +implementation is generic and supports pre-NFIT platforms, it was guided +by the superset of capabilities need to support this ACPI 6 definition +for NVDIMM resources. The bulk of the kernel implementation is in place +to handle the case where DPA accessible via PMEM is aliased with DPA +accessible via BLK. When that occurs a LABEL is needed to reserve DPA +for exclusive access via one mode a time. + +Supporting Documents +-------------------- + +ACPI 6: + http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf +NVDIMM Namespace: + http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf +DSM Interface Example: + http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf +Driver Writer's Guide: + http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf + +Git Trees +--------- + +LIBNVDIMM: + https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git +LIBNDCTL: + https://github.com/pmem/ndctl.git +PMEM: + https://github.com/01org/prd + + +LIBNVDIMM PMEM and BLK +====================== + +Prior to the arrival of the NFIT, non-volatile memory was described to a +system in various ad-hoc ways. Usually only the bare minimum was +provided, namely, a single system-physical-address range where writes +are expected to be durable after a system power loss. Now, the NFIT +specification standardizes not only the description of PMEM, but also +BLK and platform message-passing entry points for control and +configuration. + +For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block +device driver: + + 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This + range is contiguous in system memory and may be interleaved (hardware + memory controller striped) across multiple DIMMs. When interleaved the + platform may optionally provide details of which DIMMs are participating + in the interleave. + + Note that while LIBNVDIMM describes system-physical-address ranges that may + alias with BLK access as ND_NAMESPACE_PMEM ranges and those without + alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no + distinction. The different device-types are an implementation detail + that userspace can exploit to implement policies like "only interface + with address ranges from certain DIMMs". It is worth noting that when + aliasing is present and a DIMM lacks a label, then no block device can + be created by default as userspace needs to do at least one allocation + of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once + registered, can be immediately attached to nd_pmem. + + 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform + defined apertures. A set of apertures will access just one DIMM. + Multiple windows (apertures) allow multiple concurrent accesses, much like + tagged-command-queuing, and would likely be used by different threads or + different CPUs. + + The NFIT specification defines a standard format for a BLK-aperture, but + the spec also allows for vendor specific layouts, and non-NFIT BLK + implementations may have other designs for BLK I/O. For this reason + "nd_blk" calls back into platform-specific code to perform the I/O. + + One such implementation is defined in the "Driver Writer's Guide" and "DSM + Interface Example". + + +Why BLK? +======== + +While PMEM provides direct byte-addressable CPU-load/store access to +NVDIMM storage, it does not provide the best system RAS (recovery, +availability, and serviceability) model. An access to a corrupted +system-physical-address address causes a CPU exception while an access +to a corrupted address through an BLK-aperture causes that block window +to raise an error status in a register. The latter is more aligned with +the standard error model that host-bus-adapter attached disks present. + +Also, if an administrator ever wants to replace a memory it is easier to +service a system at DIMM module boundaries. Compare this to PMEM where +data could be interleaved in an opaque hardware specific manner across +several DIMMs. + +PMEM vs BLK +----------- + +BLK-apertures solve these RAS problems, but their presence is also the +major contributing factor to the complexity of the ND subsystem. They +complicate the implementation because PMEM and BLK alias in DPA space. +Any given DIMM's DPA-range may contribute to one or more +system-physical-address sets of interleaved DIMMs, *and* may also be +accessed in its entirety through its BLK-aperture. Accessing a DPA +through a system-physical-address while simultaneously accessing the +same DPA through a BLK-aperture has undefined results. For this reason, +DIMMs with this dual interface configuration include a DSM function to +store/retrieve a LABEL. The LABEL effectively partitions the DPA-space +into exclusive system-physical-address and BLK-aperture accessible +regions. For simplicity a DIMM is allowed a PMEM "region" per each +interleave set in which it is a member. The remaining DPA space can be +carved into an arbitrary number of BLK devices with discontiguous +extents. + +BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +One of the few +reasons to allow multiple BLK namespaces per REGION is so that each +BLK-namespace can be configured with a BTT with unique atomic sector +sizes. While a PMEM device can host a BTT the LABEL specification does +not provide for a sector size to be specified for a PMEM namespace. + +This is due to the expectation that the primary usage model for PMEM is +via DAX, and the BTT is incompatible with DAX. However, for the cases +where an application or filesystem still needs atomic sector update +guarantees it can register a BTT on a PMEM device or partition. See +LIBNVDIMM/NDCTL: Block Translation Table "btt" + + +Example NVDIMM Platform +======================= + +For the remainder of this document the following diagram will be +referenced for any example sysfs layouts:: + + + (a) (b) DIMM BLK-REGION + +-------------------+--------+--------+--------+ + +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 + | imc0 +--+- - - region0- - - +--------+ +--------+ + +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 + | +-------------------+--------v v--------+ + +--+---+ | | + | cpu0 | region1 + +--+---+ | | + | +----------------------------^ ^--------+ + +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 + | imc1 +--+----------------------------| +--------+ + +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 + +----------------------------+--------+--------+ + +In this platform we have four DIMMs and two memory controllers in one +socket. Each unique interface (BLK or PMEM) to DPA space is identified +by a region device with a dynamically assigned id (REGION0 - REGION5). + + 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A + single PMEM namespace is created in the REGION0-SPA-range that spans most + of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that + interleaved system-physical-address range is reclaimed as BLK-aperture + accessed space starting at DPA-offset (a) into each DIMM. In that + reclaimed space we create two BLK-aperture "namespaces" from REGION2 and + REGION3 where "blk2.0" and "blk3.0" are just human readable names that + could be set to any user-desired name in the LABEL. + + 2. In the last portion of DIMM0 and DIMM1 we have an interleaved + system-physical-address range, REGION1, that spans those two DIMMs as + well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace + named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for + each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and + "blk5.0". + + 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 + interleaved system-physical-address range (i.e. the DPA address past + offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. + Note, that this example shows that BLK-aperture namespaces don't need to + be contiguous in DPA-space. + + This bus is provided by the kernel under the device + /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and + the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the + acpi_nfit.ko driver as well. + + +LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API +======================================================== + +What follows is a description of the LIBNVDIMM sysfs layout and a +corresponding object hierarchy diagram as viewed through the LIBNDCTL +API. The example sysfs paths and diagrams are relative to the Example +NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit +test. + +LIBNDCTL: Context +----------------- + +Every API call in the LIBNDCTL library requires a context that holds the +logging parameters and other library instance state. The library is +based on the libabc template: + + https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git + +LIBNDCTL: instantiate a new library context example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct ndctl_ctx *ctx; + + if (ndctl_new(&ctx) == 0) + return ctx; + else + return NULL; + +LIBNVDIMM/LIBNDCTL: Bus +----------------------- + +A bus has a 1:1 relationship with an NFIT. The current expectation for +ACPI based systems is that there is only ever one platform-global NFIT. +That said, it is trivial to register multiple NFITs, the specification +does not preclude it. The infrastructure supports multiple busses and +we use this capability to test multiple NFIT configurations in the unit +test. + +LIBNVDIMM: control class device in /sys/class +--------------------------------------------- + +This character device accepts DSM messages to be passed to DIMM +identified by its NFIT handle:: + + /sys/class/nd/ndctl0 + |-- dev + |-- device -> ../../../ndbus0 + |-- subsystem -> ../../../../../../../class/nd + + + +LIBNVDIMM: bus +-------------- + +:: + + struct nvdimm_bus *nvdimm_bus_register(struct device *parent, + struct nvdimm_bus_descriptor *nfit_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- commands + |-- nd + |-- nfit + |-- nmem0 + |-- nmem1 + |-- nmem2 + |-- nmem3 + |-- power + |-- provider + |-- region0 + |-- region1 + |-- region2 + |-- region3 + |-- region4 + |-- region5 + |-- uevent + `-- wait_probe + +LIBNDCTL: bus enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Find the bus handle that describes the bus from Example NVDIMM Platform:: + + static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, + const char *provider) + { + struct ndctl_bus *bus; + + ndctl_bus_foreach(ctx, bus) + if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) + return bus; + + return NULL; + } + + bus = get_bus_by_provider(ctx, "nfit_test.0"); + + +LIBNVDIMM/LIBNDCTL: DIMM (NMEM) +------------------------------- + +The DIMM device provides a character device for sending commands to +hardware, and it is a container for LABELs. If the DIMM is defined by +NFIT then an optional 'nfit' attribute sub-directory is available to add +NFIT-specifics. + +Note that the kernel device name for "DIMMs" is "nmemX". The NFIT +describes these devices via "Memory Device to System Physical Address +Range Mapping Structure", and there is no requirement that they actually +be physical DIMMs, so we use a more generic name. + +LIBNVDIMM: DIMM (NMEM) +^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, + const struct attribute_group **groups, unsigned long flags, + unsigned long *dsm_mask); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- nmem0 + | |-- available_slots + | |-- commands + | |-- dev + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nvdimm + | |-- modalias + | |-- nfit + | | |-- device + | | |-- format + | | |-- handle + | | |-- phys_id + | | |-- rev_id + | | |-- serial + | | `-- vendor + | |-- state + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- nmem1 + [..] + + +LIBNDCTL: DIMM enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Note, in this example we are assuming NFIT-defined DIMMs which are +identified by an "nfit_handle" a 32-bit value where: + + - Bit 3:0 DIMM number within the memory channel + - Bit 7:4 memory channel number + - Bit 11:8 memory controller ID + - Bit 15:12 socket ID (within scope of a Node controller if node + controller is present) + - Bit 27:16 Node Controller ID + - Bit 31:28 Reserved + +:: + + static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_dimm *dimm; + + ndctl_dimm_foreach(bus, dimm) + if (ndctl_dimm_get_handle(dimm) == handle) + return dimm; + + return NULL; + } + + #define DIMM_HANDLE(n, s, i, c, d) \ + (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ + | ((c & 0xf) << 4) | (d & 0xf)) + + dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); + +LIBNVDIMM/LIBNDCTL: Region +-------------------------- + +A generic REGION device is registered for each PMEM range or BLK-aperture +set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture +sets on the "nfit_test.0" bus. The primary role of regions are to be a +container of "mappings". A mapping is a tuple of . + +LIBNVDIMM provides a built-in driver for these REGION devices. This driver +is responsible for reconciling the aliased DPA mappings across all +regions, parsing the LABEL, if present, and then emitting NAMESPACE +devices with the resolved/exclusive DPA-boundaries for the nd_pmem or +nd_blk device driver to consume. + +In addition to the generic attributes of "mapping"s, "interleave_ways" +and "size" the REGION device also exports some convenience attributes. +"nstype" indicates the integer type of namespace-device this region +emits, "devtype" duplicates the DEVTYPE variable stored by udev at the +'add' event, "modalias" duplicates the MODALIAS variable stored by udev +at the 'add' event, and finally, the optional "spa_index" is provided in +the case where the region is defined by a SPA. + +LIBNVDIMM: region:: + + struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, + struct nd_region_desc *ndr_desc); + struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, + struct nd_region_desc *ndr_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- region0 + | |-- available_size + | |-- btt0 + | |-- btt_seed + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_region + | |-- init_namespaces + | |-- mapping0 + | |-- mapping1 + | |-- mappings + | |-- modalias + | |-- namespace0.0 + | |-- namespace_seed + | |-- numa_node + | |-- nfit + | | `-- spa_index + | |-- nstype + | |-- set_cookie + | |-- size + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- region1 + [..] + +LIBNDCTL: region enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sample region retrieval routines based on NFIT-unique data like +"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for +BLK:: + + static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, + unsigned int spa_index) + { + struct ndctl_region *region; + + ndctl_region_foreach(bus, region) { + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) + continue; + if (ndctl_region_get_spa_index(region) == spa_index) + return region; + } + return NULL; + } + + static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_region *region; + + ndctl_region_foreach(bus, region) { + struct ndctl_mapping *map; + + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) + continue; + ndctl_mapping_foreach(region, map) { + struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); + + if (ndctl_dimm_get_handle(dimm) == handle) + return region; + } + } + return NULL; + } + + +Why Not Encode the Region Type into the Region Name? +---------------------------------------------------- + +At first glance it seems since NFIT defines just PMEM and BLK interface +types that we should simply name REGION devices with something derived +from those type names. However, the ND subsystem explicitly keeps the +REGION name generic and expects userspace to always consider the +region-attributes for four reasons: + + 1. There are already more than two REGION and "namespace" types. For + PMEM there are two subtypes. As mentioned previously we have PMEM where + the constituent DIMM devices are known and anonymous PMEM. For BLK + regions the NFIT specification already anticipates vendor specific + implementations. The exact distinction of what a region contains is in + the region-attributes not the region-name or the region-devtype. + + 2. A region with zero child-namespaces is a possible configuration. For + example, the NFIT allows for a DCR to be published without a + corresponding BLK-aperture. This equates to a DIMM that can only accept + control/configuration messages, but no i/o through a descendant block + device. Again, this "type" is advertised in the attributes ('mappings' + == 0) and the name does not tell you much. + + 3. What if a third major interface type arises in the future? Outside + of vendor specific implementations, it's not difficult to envision a + third class of interface type beyond BLK and PMEM. With a generic name + for the REGION level of the device-hierarchy old userspace + implementations can still make sense of new kernel advertised + region-types. Userspace can always rely on the generic region + attributes like "mappings", "size", etc and the expected child devices + named "namespace". This generic format of the device-model hierarchy + allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and + future-proof. + + 4. There are more robust mechanisms for determining the major type of a + region than a device name. See the next section, How Do I Determine the + Major Type of a Region? + +How Do I Determine the Major Type of a Region? +---------------------------------------------- + +Outside of the blanket recommendation of "use libndctl", or simply +looking at the kernel header (/usr/include/linux/ndctl.h) to decode the +"nstype" integer attribute, here are some other options. + +1. module alias lookup +^^^^^^^^^^^^^^^^^^^^^^ + + The whole point of region/namespace device type differentiation is to + decide which block-device driver will attach to a given LIBNVDIMM namespace. + One can simply use the modalias to lookup the resulting module. It's + important to note that this method is robust in the presence of a + vendor-specific driver down the road. If a vendor-specific + implementation wants to supplant the standard nd_blk driver it can with + minimal impact to the rest of LIBNVDIMM. + + In fact, a vendor may also want to have a vendor-specific region-driver + (outside of nd_region). For example, if a vendor defined its own LABEL + format it would need its own region driver to parse that LABEL and emit + the resulting namespaces. The output from module resolution is more + accurate than a region-name or region-devtype. + +2. udev +^^^^^^^ + + The kernel "devtype" is registered in the udev database:: + + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 + P: /devices/platform/nfit_test.0/ndbus0/region0 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 + E: DEVTYPE=nd_pmem + E: MODALIAS=nd:t2 + E: SUBSYSTEM=nd + + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 + P: /devices/platform/nfit_test.0/ndbus0/region4 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 + E: DEVTYPE=nd_blk + E: MODALIAS=nd:t3 + E: SUBSYSTEM=nd + + ...and is available as a region attribute, but keep in mind that the + "devtype" does not indicate sub-type variations and scripts should + really be understanding the other attributes. + +3. type specific attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + As it currently stands a BLK-aperture region will never have a + "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A + BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM + that does not allow I/O. A PMEM region with a "mappings" value of zero + is a simple system-physical-address range. + + +LIBNVDIMM/LIBNDCTL: Namespace +----------------------------- + +A REGION, after resolving DPA aliasing and LABEL specified boundaries, +surfaces one or more "namespace" devices. The arrival of a "namespace" +device currently triggers either the nd_blk or nd_pmem driver to load +and register a disk/block device. + +LIBNVDIMM: namespace +^^^^^^^^^^^^^^^^^^^^ + +Here is a sample layout from the three major types of NAMESPACE where +namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' +attribute), namespace2.0 represents a BLK namespace (note it has a +'sector_size' attribute) that, and namespace6.0 represents an anonymous +PMEM namespace (note that has no 'uuid' attribute due to not support a +LABEL):: + + /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- force_raw + |-- modalias + |-- numa_node + |-- sector_size + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 + |-- block + | `-- pmem0 + |-- devtype + |-- driver -> ../../../../../../bus/nd/drivers/pmem + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + `-- uevent + +LIBNDCTL: namespace enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Namespaces are indexed relative to their parent region, example below. +These indexes are mostly static from boot to boot, but subsystem makes +no guarantees in this regard. For a static namespace identifier use its +'uuid' attribute. + +:: + + static struct ndctl_namespace + *get_namespace_by_id(struct ndctl_region *region, unsigned int id) + { + struct ndctl_namespace *ndns; + + ndctl_namespace_foreach(region, ndns) + if (ndctl_namespace_get_id(ndns) == id) + return ndns; + + return NULL; + } + +LIBNDCTL: namespace creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Idle namespaces are automatically created by the kernel if a given +region has enough available capacity to create a new namespace. +Namespace instantiation involves finding an idle namespace and +configuring it. For the most part the setting of namespace attributes +can occur in any order, the only constraint is that 'uuid' must be set +before 'size'. This enables the kernel to track DPA allocations +internally with a static identifier:: + + static int configure_namespace(struct ndctl_region *region, + struct ndctl_namespace *ndns, + struct namespace_parameters *parameters) + { + char devname[50]; + + snprintf(devname, sizeof(devname), "namespace%d.%d", + ndctl_region_get_id(region), paramaters->id); + + ndctl_namespace_set_alt_name(ndns, devname); + /* 'uuid' must be set prior to setting size! */ + ndctl_namespace_set_uuid(ndns, paramaters->uuid); + ndctl_namespace_set_size(ndns, paramaters->size); + /* unlike pmem namespaces, blk namespaces have a sector size */ + if (parameters->lbasize) + ndctl_namespace_set_sector_size(ndns, parameters->lbasize); + ndctl_namespace_enable(ndns); + } + + +Why the Term "namespace"? +^^^^^^^^^^^^^^^^^^^^^^^^^ + + 1. Why not "volume" for instance? "volume" ran the risk of confusing + ND (libnvdimm subsystem) to a volume manager like device-mapper. + + 2. The term originated to describe the sub-devices that can be created + within a NVME controller (see the nvme specification: + http://www.nvmexpress.org/specifications/), and NFIT namespaces are + meant to parallel the capabilities and configurability of + NVME-namespaces. + + +LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" +------------------------------------------------- + +A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked +block device driver that fronts either the whole block device or a +partition of a block device emitted by either a PMEM or BLK NAMESPACE. + +LIBNVDIMM: btt layout +^^^^^^^^^^^^^^^^^^^^^ + +Every region will start out with at least one BTT device which is the +seed device. To activate it set the "namespace", "uuid", and +"sector_size" attributes and then bind the device to the nd_pmem or +nd_blk driver depending on the region type:: + + /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ + |-- namespace + |-- delete + |-- devtype + |-- modalias + |-- numa_node + |-- sector_size + |-- subsystem -> ../../../../../bus/nd + |-- uevent + `-- uuid + +LIBNDCTL: btt creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to namespaces an idle BTT device is automatically created per +region. Each time this "seed" btt device is configured and enabled a new +seed is created. Creating a BTT configuration involves two steps of +finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: + + static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) + { + struct ndctl_btt *btt; + + ndctl_btt_foreach(region, btt) + if (!ndctl_btt_is_enabled(btt) + && !ndctl_btt_is_configured(btt)) + return btt; + + return NULL; + } + + static int configure_btt(struct ndctl_region *region, + struct btt_parameters *parameters) + { + btt = get_idle_btt(region); + + ndctl_btt_set_uuid(btt, parameters->uuid); + ndctl_btt_set_sector_size(btt, parameters->sector_size); + ndctl_btt_set_namespace(btt, parameters->ndns); + /* turn off raw mode device */ + ndctl_namespace_disable(parameters->ndns); + /* turn on btt access */ + ndctl_btt_enable(btt); + } + +Once instantiated a new inactive btt seed device will appear underneath +the region. + +Once a "namespace" is removed from a BTT that instance of the BTT device +will be deleted or otherwise reset to default values. This deletion is +only at the device model level. In order to destroy a BTT the "info +block" needs to be destroyed. Note, that to destroy a BTT the media +needs to be written in raw mode. By default, the kernel will autodetect +the presence of a BTT and disable raw mode. This autodetect behavior +can be suppressed by enabling raw mode for the namespace via the +ndctl_namespace_set_raw_mode() API. + + +Summary LIBNDCTL Diagram +------------------------ + +For the given example above, here is the view of the objects as seen by the +LIBNDCTL API:: + + +---+ + |CTX| +---------+ +--------------+ +---------------+ + +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | + | | +---------+ +--------------+ +---------------+ + +-------+ | | +---------+ +--------------+ +---------------+ + | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | + +-------+ | | | +---------+ +--------------+ +---------------+ + | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ + +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | + | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ + +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | + | DIMM3 <-+ | +--------------+ +----------------------+ + +-------+ | +---------+ +--------------+ +---------------+ + +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | + | +---------+ | +--------------+ +----------------------+ + | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | + | +--------------+ +----------------------+ + | +---------+ +--------------+ +---------------+ + +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | + | +---------+ +--------------+ +---------------+ + | +---------+ +--------------+ +----------------------+ + +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | + +---------+ +--------------+ +---------------+------+ diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt deleted file mode 100644 index 1669f626b037..000000000000 --- a/Documentation/nvdimm/nvdimm.txt +++ /dev/null @@ -1,815 +0,0 @@ - LIBNVDIMM: Non-Volatile Devices - libnvdimm - kernel / libndctl - userspace helper library - linux-nvdimm@lists.01.org - v13 - - - Glossary - Overview - Supporting Documents - Git Trees - LIBNVDIMM PMEM and BLK - Why BLK? - PMEM vs BLK - BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX - Example NVDIMM Platform - LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API - LIBNDCTL: Context - libndctl: instantiate a new library context example - LIBNVDIMM/LIBNDCTL: Bus - libnvdimm: control class device in /sys/class - libnvdimm: bus - libndctl: bus enumeration example - LIBNVDIMM/LIBNDCTL: DIMM (NMEM) - libnvdimm: DIMM (NMEM) - libndctl: DIMM enumeration example - LIBNVDIMM/LIBNDCTL: Region - libnvdimm: region - libndctl: region enumeration example - Why Not Encode the Region Type into the Region Name? - How Do I Determine the Major Type of a Region? - LIBNVDIMM/LIBNDCTL: Namespace - libnvdimm: namespace - libndctl: namespace enumeration example - libndctl: namespace creation example - Why the Term "namespace"? - LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" - libnvdimm: btt layout - libndctl: btt creation example - Summary LIBNDCTL Diagram - - -Glossary --------- - -PMEM: A system-physical-address range where writes are persistent. A -block device composed of PMEM is capable of DAX. A PMEM address range -may span an interleave of several DIMMs. - -BLK: A set of one or more programmable memory mapped apertures provided -by a DIMM to access its media. This indirection precludes the -performance benefit of interleaving, but enables DIMM-bounded failure -modes. - -DPA: DIMM Physical Address, is a DIMM-relative offset. With one DIMM in -the system there would be a 1:1 system-physical-address:DPA association. -Once more DIMMs are added a memory controller interleave must be -decoded to determine the DPA associated with a given -system-physical-address. BLK capacity always has a 1:1 relationship -with a single-DIMM's DPA range. - -DAX: File system extensions to bypass the page cache and block layer to -mmap persistent memory, from a PMEM block device, directly into a -process address space. - -DSM: Device Specific Method: ACPI method to to control specific -device - in this case the firmware. - -DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. -It defines a vendor-id, device-id, and interface format for a given DIMM. - -BTT: Block Translation Table: Persistent memory is byte addressable. -Existing software may have an expectation that the power-fail-atomicity -of writes is at least one sector, 512 bytes. The BTT is an indirection -table with atomic update semantics to front a PMEM/BLK block device -driver and present arbitrary atomic sector sizes. - -LABEL: Metadata stored on a DIMM device that partitions and identifies -(persistently names) storage between PMEM and BLK. It also partitions -BLK storage to host BTTs with different parameters per BLK-partition. -Note that traditional partition tables, GPT/MBR, are layered on top of a -BLK or PMEM device. - - -Overview --------- - -The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, -PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM -and BLK mode access. These three modes of operation are described by -the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM -implementation is generic and supports pre-NFIT platforms, it was guided -by the superset of capabilities need to support this ACPI 6 definition -for NVDIMM resources. The bulk of the kernel implementation is in place -to handle the case where DPA accessible via PMEM is aliased with DPA -accessible via BLK. When that occurs a LABEL is needed to reserve DPA -for exclusive access via one mode a time. - -Supporting Documents -ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf -NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf -DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf -Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf - -Git Trees -LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git -LIBNDCTL: https://github.com/pmem/ndctl.git -PMEM: https://github.com/01org/prd - - -LIBNVDIMM PMEM and BLK ------------------- - -Prior to the arrival of the NFIT, non-volatile memory was described to a -system in various ad-hoc ways. Usually only the bare minimum was -provided, namely, a single system-physical-address range where writes -are expected to be durable after a system power loss. Now, the NFIT -specification standardizes not only the description of PMEM, but also -BLK and platform message-passing entry points for control and -configuration. - -For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block -device driver: - - 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This - range is contiguous in system memory and may be interleaved (hardware - memory controller striped) across multiple DIMMs. When interleaved the - platform may optionally provide details of which DIMMs are participating - in the interleave. - - Note that while LIBNVDIMM describes system-physical-address ranges that may - alias with BLK access as ND_NAMESPACE_PMEM ranges and those without - alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no - distinction. The different device-types are an implementation detail - that userspace can exploit to implement policies like "only interface - with address ranges from certain DIMMs". It is worth noting that when - aliasing is present and a DIMM lacks a label, then no block device can - be created by default as userspace needs to do at least one allocation - of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once - registered, can be immediately attached to nd_pmem. - - 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform - defined apertures. A set of apertures will access just one DIMM. - Multiple windows (apertures) allow multiple concurrent accesses, much like - tagged-command-queuing, and would likely be used by different threads or - different CPUs. - - The NFIT specification defines a standard format for a BLK-aperture, but - the spec also allows for vendor specific layouts, and non-NFIT BLK - implementations may have other designs for BLK I/O. For this reason - "nd_blk" calls back into platform-specific code to perform the I/O. - One such implementation is defined in the "Driver Writer's Guide" and "DSM - Interface Example". - - -Why BLK? --------- - -While PMEM provides direct byte-addressable CPU-load/store access to -NVDIMM storage, it does not provide the best system RAS (recovery, -availability, and serviceability) model. An access to a corrupted -system-physical-address address causes a CPU exception while an access -to a corrupted address through an BLK-aperture causes that block window -to raise an error status in a register. The latter is more aligned with -the standard error model that host-bus-adapter attached disks present. -Also, if an administrator ever wants to replace a memory it is easier to -service a system at DIMM module boundaries. Compare this to PMEM where -data could be interleaved in an opaque hardware specific manner across -several DIMMs. - -PMEM vs BLK -BLK-apertures solve these RAS problems, but their presence is also the -major contributing factor to the complexity of the ND subsystem. They -complicate the implementation because PMEM and BLK alias in DPA space. -Any given DIMM's DPA-range may contribute to one or more -system-physical-address sets of interleaved DIMMs, *and* may also be -accessed in its entirety through its BLK-aperture. Accessing a DPA -through a system-physical-address while simultaneously accessing the -same DPA through a BLK-aperture has undefined results. For this reason, -DIMMs with this dual interface configuration include a DSM function to -store/retrieve a LABEL. The LABEL effectively partitions the DPA-space -into exclusive system-physical-address and BLK-aperture accessible -regions. For simplicity a DIMM is allowed a PMEM "region" per each -interleave set in which it is a member. The remaining DPA space can be -carved into an arbitrary number of BLK devices with discontiguous -extents. - -BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX --------------------------------------------------- - -One of the few -reasons to allow multiple BLK namespaces per REGION is so that each -BLK-namespace can be configured with a BTT with unique atomic sector -sizes. While a PMEM device can host a BTT the LABEL specification does -not provide for a sector size to be specified for a PMEM namespace. -This is due to the expectation that the primary usage model for PMEM is -via DAX, and the BTT is incompatible with DAX. However, for the cases -where an application or filesystem still needs atomic sector update -guarantees it can register a BTT on a PMEM device or partition. See -LIBNVDIMM/NDCTL: Block Translation Table "btt" - - -Example NVDIMM Platform ------------------------ - -For the remainder of this document the following diagram will be -referenced for any example sysfs layouts. - - - (a) (b) DIMM BLK-REGION - +-------------------+--------+--------+--------+ -+------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 -| imc0 +--+- - - region0- - - +--------+ +--------+ -+--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 - | +-------------------+--------v v--------+ -+--+---+ | | -| cpu0 | region1 -+--+---+ | | - | +----------------------------^ ^--------+ -+--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 -| imc1 +--+----------------------------| +--------+ -+------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 - +----------------------------+--------+--------+ - -In this platform we have four DIMMs and two memory controllers in one -socket. Each unique interface (BLK or PMEM) to DPA space is identified -by a region device with a dynamically assigned id (REGION0 - REGION5). - - 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A - single PMEM namespace is created in the REGION0-SPA-range that spans most - of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that - interleaved system-physical-address range is reclaimed as BLK-aperture - accessed space starting at DPA-offset (a) into each DIMM. In that - reclaimed space we create two BLK-aperture "namespaces" from REGION2 and - REGION3 where "blk2.0" and "blk3.0" are just human readable names that - could be set to any user-desired name in the LABEL. - - 2. In the last portion of DIMM0 and DIMM1 we have an interleaved - system-physical-address range, REGION1, that spans those two DIMMs as - well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace - named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for - each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and - "blk5.0". - - 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 - interleaved system-physical-address range (i.e. the DPA address past - offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. - Note, that this example shows that BLK-aperture namespaces don't need to - be contiguous in DPA-space. - - This bus is provided by the kernel under the device - /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and - the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the - acpi_nfit.ko driver as well. - - -LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API ----------------------------------------------------- - -What follows is a description of the LIBNVDIMM sysfs layout and a -corresponding object hierarchy diagram as viewed through the LIBNDCTL -API. The example sysfs paths and diagrams are relative to the Example -NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit -test. - -LIBNDCTL: Context -Every API call in the LIBNDCTL library requires a context that holds the -logging parameters and other library instance state. The library is -based on the libabc template: -https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git - -LIBNDCTL: instantiate a new library context example - - struct ndctl_ctx *ctx; - - if (ndctl_new(&ctx) == 0) - return ctx; - else - return NULL; - -LIBNVDIMM/LIBNDCTL: Bus -------------------- - -A bus has a 1:1 relationship with an NFIT. The current expectation for -ACPI based systems is that there is only ever one platform-global NFIT. -That said, it is trivial to register multiple NFITs, the specification -does not preclude it. The infrastructure supports multiple busses and -we use this capability to test multiple NFIT configurations in the unit -test. - -LIBNVDIMM: control class device in /sys/class - -This character device accepts DSM messages to be passed to DIMM -identified by its NFIT handle. - - /sys/class/nd/ndctl0 - |-- dev - |-- device -> ../../../ndbus0 - |-- subsystem -> ../../../../../../../class/nd - - - -LIBNVDIMM: bus - - struct nvdimm_bus *nvdimm_bus_register(struct device *parent, - struct nvdimm_bus_descriptor *nfit_desc); - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- commands - |-- nd - |-- nfit - |-- nmem0 - |-- nmem1 - |-- nmem2 - |-- nmem3 - |-- power - |-- provider - |-- region0 - |-- region1 - |-- region2 - |-- region3 - |-- region4 - |-- region5 - |-- uevent - `-- wait_probe - -LIBNDCTL: bus enumeration example -Find the bus handle that describes the bus from Example NVDIMM Platform - - static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, - const char *provider) - { - struct ndctl_bus *bus; - - ndctl_bus_foreach(ctx, bus) - if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) - return bus; - - return NULL; - } - - bus = get_bus_by_provider(ctx, "nfit_test.0"); - - -LIBNVDIMM/LIBNDCTL: DIMM (NMEM) ---------------------------- - -The DIMM device provides a character device for sending commands to -hardware, and it is a container for LABELs. If the DIMM is defined by -NFIT then an optional 'nfit' attribute sub-directory is available to add -NFIT-specifics. - -Note that the kernel device name for "DIMMs" is "nmemX". The NFIT -describes these devices via "Memory Device to System Physical Address -Range Mapping Structure", and there is no requirement that they actually -be physical DIMMs, so we use a more generic name. - -LIBNVDIMM: DIMM (NMEM) - - struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, - const struct attribute_group **groups, unsigned long flags, - unsigned long *dsm_mask); - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- nmem0 - | |-- available_slots - | |-- commands - | |-- dev - | |-- devtype - | |-- driver -> ../../../../../bus/nd/drivers/nvdimm - | |-- modalias - | |-- nfit - | | |-- device - | | |-- format - | | |-- handle - | | |-- phys_id - | | |-- rev_id - | | |-- serial - | | `-- vendor - | |-- state - | |-- subsystem -> ../../../../../bus/nd - | `-- uevent - |-- nmem1 - [..] - - -LIBNDCTL: DIMM enumeration example - -Note, in this example we are assuming NFIT-defined DIMMs which are -identified by an "nfit_handle" a 32-bit value where: -Bit 3:0 DIMM number within the memory channel -Bit 7:4 memory channel number -Bit 11:8 memory controller ID -Bit 15:12 socket ID (within scope of a Node controller if node controller is present) -Bit 27:16 Node Controller ID -Bit 31:28 Reserved - - static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, - unsigned int handle) - { - struct ndctl_dimm *dimm; - - ndctl_dimm_foreach(bus, dimm) - if (ndctl_dimm_get_handle(dimm) == handle) - return dimm; - - return NULL; - } - - #define DIMM_HANDLE(n, s, i, c, d) \ - (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ - | ((c & 0xf) << 4) | (d & 0xf)) - - dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); - -LIBNVDIMM/LIBNDCTL: Region ----------------------- - -A generic REGION device is registered for each PMEM range or BLK-aperture -set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture -sets on the "nfit_test.0" bus. The primary role of regions are to be a -container of "mappings". A mapping is a tuple of . - -LIBNVDIMM provides a built-in driver for these REGION devices. This driver -is responsible for reconciling the aliased DPA mappings across all -regions, parsing the LABEL, if present, and then emitting NAMESPACE -devices with the resolved/exclusive DPA-boundaries for the nd_pmem or -nd_blk device driver to consume. - -In addition to the generic attributes of "mapping"s, "interleave_ways" -and "size" the REGION device also exports some convenience attributes. -"nstype" indicates the integer type of namespace-device this region -emits, "devtype" duplicates the DEVTYPE variable stored by udev at the -'add' event, "modalias" duplicates the MODALIAS variable stored by udev -at the 'add' event, and finally, the optional "spa_index" is provided in -the case where the region is defined by a SPA. - -LIBNVDIMM: region - - struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, - struct nd_region_desc *ndr_desc); - struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, - struct nd_region_desc *ndr_desc); - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- region0 - | |-- available_size - | |-- btt0 - | |-- btt_seed - | |-- devtype - | |-- driver -> ../../../../../bus/nd/drivers/nd_region - | |-- init_namespaces - | |-- mapping0 - | |-- mapping1 - | |-- mappings - | |-- modalias - | |-- namespace0.0 - | |-- namespace_seed - | |-- numa_node - | |-- nfit - | | `-- spa_index - | |-- nstype - | |-- set_cookie - | |-- size - | |-- subsystem -> ../../../../../bus/nd - | `-- uevent - |-- region1 - [..] - -LIBNDCTL: region enumeration example - -Sample region retrieval routines based on NFIT-unique data like -"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for -BLK. - - static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, - unsigned int spa_index) - { - struct ndctl_region *region; - - ndctl_region_foreach(bus, region) { - if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) - continue; - if (ndctl_region_get_spa_index(region) == spa_index) - return region; - } - return NULL; - } - - static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, - unsigned int handle) - { - struct ndctl_region *region; - - ndctl_region_foreach(bus, region) { - struct ndctl_mapping *map; - - if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) - continue; - ndctl_mapping_foreach(region, map) { - struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); - - if (ndctl_dimm_get_handle(dimm) == handle) - return region; - } - } - return NULL; - } - - -Why Not Encode the Region Type into the Region Name? ----------------------------------------------------- - -At first glance it seems since NFIT defines just PMEM and BLK interface -types that we should simply name REGION devices with something derived -from those type names. However, the ND subsystem explicitly keeps the -REGION name generic and expects userspace to always consider the -region-attributes for four reasons: - - 1. There are already more than two REGION and "namespace" types. For - PMEM there are two subtypes. As mentioned previously we have PMEM where - the constituent DIMM devices are known and anonymous PMEM. For BLK - regions the NFIT specification already anticipates vendor specific - implementations. The exact distinction of what a region contains is in - the region-attributes not the region-name or the region-devtype. - - 2. A region with zero child-namespaces is a possible configuration. For - example, the NFIT allows for a DCR to be published without a - corresponding BLK-aperture. This equates to a DIMM that can only accept - control/configuration messages, but no i/o through a descendant block - device. Again, this "type" is advertised in the attributes ('mappings' - == 0) and the name does not tell you much. - - 3. What if a third major interface type arises in the future? Outside - of vendor specific implementations, it's not difficult to envision a - third class of interface type beyond BLK and PMEM. With a generic name - for the REGION level of the device-hierarchy old userspace - implementations can still make sense of new kernel advertised - region-types. Userspace can always rely on the generic region - attributes like "mappings", "size", etc and the expected child devices - named "namespace". This generic format of the device-model hierarchy - allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and - future-proof. - - 4. There are more robust mechanisms for determining the major type of a - region than a device name. See the next section, How Do I Determine the - Major Type of a Region? - -How Do I Determine the Major Type of a Region? ----------------------------------------------- - -Outside of the blanket recommendation of "use libndctl", or simply -looking at the kernel header (/usr/include/linux/ndctl.h) to decode the -"nstype" integer attribute, here are some other options. - - 1. module alias lookup: - - The whole point of region/namespace device type differentiation is to - decide which block-device driver will attach to a given LIBNVDIMM namespace. - One can simply use the modalias to lookup the resulting module. It's - important to note that this method is robust in the presence of a - vendor-specific driver down the road. If a vendor-specific - implementation wants to supplant the standard nd_blk driver it can with - minimal impact to the rest of LIBNVDIMM. - - In fact, a vendor may also want to have a vendor-specific region-driver - (outside of nd_region). For example, if a vendor defined its own LABEL - format it would need its own region driver to parse that LABEL and emit - the resulting namespaces. The output from module resolution is more - accurate than a region-name or region-devtype. - - 2. udev: - - The kernel "devtype" is registered in the udev database - # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 - P: /devices/platform/nfit_test.0/ndbus0/region0 - E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 - E: DEVTYPE=nd_pmem - E: MODALIAS=nd:t2 - E: SUBSYSTEM=nd - - # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 - P: /devices/platform/nfit_test.0/ndbus0/region4 - E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 - E: DEVTYPE=nd_blk - E: MODALIAS=nd:t3 - E: SUBSYSTEM=nd - - ...and is available as a region attribute, but keep in mind that the - "devtype" does not indicate sub-type variations and scripts should - really be understanding the other attributes. - - 3. type specific attributes: - - As it currently stands a BLK-aperture region will never have a - "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A - BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM - that does not allow I/O. A PMEM region with a "mappings" value of zero - is a simple system-physical-address range. - - -LIBNVDIMM/LIBNDCTL: Namespace -------------------------- - -A REGION, after resolving DPA aliasing and LABEL specified boundaries, -surfaces one or more "namespace" devices. The arrival of a "namespace" -device currently triggers either the nd_blk or nd_pmem driver to load -and register a disk/block device. - -LIBNVDIMM: namespace -Here is a sample layout from the three major types of NAMESPACE where -namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' -attribute), namespace2.0 represents a BLK namespace (note it has a -'sector_size' attribute) that, and namespace6.0 represents an anonymous -PMEM namespace (note that has no 'uuid' attribute due to not support a -LABEL). - - /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 - |-- alt_name - |-- devtype - |-- dpa_extents - |-- force_raw - |-- modalias - |-- numa_node - |-- resource - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - |-- uevent - `-- uuid - /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 - |-- alt_name - |-- devtype - |-- dpa_extents - |-- force_raw - |-- modalias - |-- numa_node - |-- sector_size - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - |-- uevent - `-- uuid - /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 - |-- block - | `-- pmem0 - |-- devtype - |-- driver -> ../../../../../../bus/nd/drivers/pmem - |-- force_raw - |-- modalias - |-- numa_node - |-- resource - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - `-- uevent - -LIBNDCTL: namespace enumeration example -Namespaces are indexed relative to their parent region, example below. -These indexes are mostly static from boot to boot, but subsystem makes -no guarantees in this regard. For a static namespace identifier use its -'uuid' attribute. - -static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region, - unsigned int id) -{ - struct ndctl_namespace *ndns; - - ndctl_namespace_foreach(region, ndns) - if (ndctl_namespace_get_id(ndns) == id) - return ndns; - - return NULL; -} - -LIBNDCTL: namespace creation example -Idle namespaces are automatically created by the kernel if a given -region has enough available capacity to create a new namespace. -Namespace instantiation involves finding an idle namespace and -configuring it. For the most part the setting of namespace attributes -can occur in any order, the only constraint is that 'uuid' must be set -before 'size'. This enables the kernel to track DPA allocations -internally with a static identifier. - -static int configure_namespace(struct ndctl_region *region, - struct ndctl_namespace *ndns, - struct namespace_parameters *parameters) -{ - char devname[50]; - - snprintf(devname, sizeof(devname), "namespace%d.%d", - ndctl_region_get_id(region), paramaters->id); - - ndctl_namespace_set_alt_name(ndns, devname); - /* 'uuid' must be set prior to setting size! */ - ndctl_namespace_set_uuid(ndns, paramaters->uuid); - ndctl_namespace_set_size(ndns, paramaters->size); - /* unlike pmem namespaces, blk namespaces have a sector size */ - if (parameters->lbasize) - ndctl_namespace_set_sector_size(ndns, parameters->lbasize); - ndctl_namespace_enable(ndns); -} - - -Why the Term "namespace"? - - 1. Why not "volume" for instance? "volume" ran the risk of confusing - ND (libnvdimm subsystem) to a volume manager like device-mapper. - - 2. The term originated to describe the sub-devices that can be created - within a NVME controller (see the nvme specification: - http://www.nvmexpress.org/specifications/), and NFIT namespaces are - meant to parallel the capabilities and configurability of - NVME-namespaces. - - -LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" ---------------------------------------------- - -A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked -block device driver that fronts either the whole block device or a -partition of a block device emitted by either a PMEM or BLK NAMESPACE. - -LIBNVDIMM: btt layout -Every region will start out with at least one BTT device which is the -seed device. To activate it set the "namespace", "uuid", and -"sector_size" attributes and then bind the device to the nd_pmem or -nd_blk driver depending on the region type. - - /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ - |-- namespace - |-- delete - |-- devtype - |-- modalias - |-- numa_node - |-- sector_size - |-- subsystem -> ../../../../../bus/nd - |-- uevent - `-- uuid - -LIBNDCTL: btt creation example -Similar to namespaces an idle BTT device is automatically created per -region. Each time this "seed" btt device is configured and enabled a new -seed is created. Creating a BTT configuration involves two steps of -finding and idle BTT and assigning it to consume a PMEM or BLK namespace. - - static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) - { - struct ndctl_btt *btt; - - ndctl_btt_foreach(region, btt) - if (!ndctl_btt_is_enabled(btt) - && !ndctl_btt_is_configured(btt)) - return btt; - - return NULL; - } - - static int configure_btt(struct ndctl_region *region, - struct btt_parameters *parameters) - { - btt = get_idle_btt(region); - - ndctl_btt_set_uuid(btt, parameters->uuid); - ndctl_btt_set_sector_size(btt, parameters->sector_size); - ndctl_btt_set_namespace(btt, parameters->ndns); - /* turn off raw mode device */ - ndctl_namespace_disable(parameters->ndns); - /* turn on btt access */ - ndctl_btt_enable(btt); - } - -Once instantiated a new inactive btt seed device will appear underneath -the region. - -Once a "namespace" is removed from a BTT that instance of the BTT device -will be deleted or otherwise reset to default values. This deletion is -only at the device model level. In order to destroy a BTT the "info -block" needs to be destroyed. Note, that to destroy a BTT the media -needs to be written in raw mode. By default, the kernel will autodetect -the presence of a BTT and disable raw mode. This autodetect behavior -can be suppressed by enabling raw mode for the namespace via the -ndctl_namespace_set_raw_mode() API. - - -Summary LIBNDCTL Diagram ------------------------- - -For the given example above, here is the view of the objects as seen by the -LIBNDCTL API: - +---+ - |CTX| +---------+ +--------------+ +---------------+ - +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | - | | +---------+ +--------------+ +---------------+ -+-------+ | | +---------+ +--------------+ +---------------+ -| DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | -+-------+ | | | +---------+ +--------------+ +---------------+ -| DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ -+-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | -| DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ -+-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | -| DIMM3 <-+ | +--------------+ +----------------------+ -+-------+ | +---------+ +--------------+ +---------------+ - +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | - | +---------+ | +--------------+ +----------------------+ - | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | - | +--------------+ +----------------------+ - | +---------+ +--------------+ +---------------+ - +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | - | +---------+ +--------------+ +---------------+ - | +---------+ +--------------+ +----------------------+ - +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | - +---------+ +--------------+ +---------------+------+ - - diff --git a/Documentation/nvdimm/security.rst b/Documentation/nvdimm/security.rst new file mode 100644 index 000000000000..ad9dea099b34 --- /dev/null +++ b/Documentation/nvdimm/security.rst @@ -0,0 +1,143 @@ +=============== +NVDIMM Security +=============== + +1. Introduction +--------------- + +With the introduction of Intel Device Specific Methods (DSM) v1.8 +specification [1], security DSMs are introduced. The spec added the following +security DSMs: "get security state", "set passphrase", "disable passphrase", +"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops +data structure has been added to struct dimm in order to support the security +operations and generic APIs are exposed to allow vendor neutral operations. + +2. Sysfs Interface +------------------ +The "security" sysfs attribute is provided in the nvdimm sysfs directory. For +example: +/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security + +The "show" attribute of that attribute will display the security state for +that DIMM. The following states are available: disabled, unlocked, locked, +frozen, and overwrite. If security is not supported, the sysfs attribute +will not be visible. + +The "store" attribute takes several commands when it is being written to +in order to support some of the security functionalities: +update - enable or update passphrase. +disable - disable enabled security and remove key. +freeze - freeze changing of security states. +erase - delete existing user encryption key. +overwrite - wipe the entire nvdimm. +master_update - enable or update master passphrase. +master_erase - delete existing user encryption key. + +3. Key Management +----------------- + +The key is associated to the payload by the DIMM id. For example: +# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id +8089-a2-1740-00000133 +The DIMM id would be provided along with the key payload (passphrase) to +the kernel. + +The security keys are managed on the basis of a single key per DIMM. The +key "passphrase" is expected to be 32bytes long. This is similar to the ATA +security specification [2]. A key is initially acquired via the request_key() +kernel API call during nvdimm unlock. It is up to the user to make sure that +all the keys are in the kernel user keyring for unlock. + +A nvdimm encrypted-key of format enc32 has the description format of: +nvdimm: + +See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating +encrypted-keys of enc32 format. TPM usage with a master trusted key is +preferred for sealing the encrypted-keys. + +4. Unlocking +------------ +When the DIMMs are being enumerated by the kernel, the kernel will attempt to +retrieve the key from the kernel user keyring. This is the only time +a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked +until reboot. Typically an entity (i.e. shell script) will inject all the +relevant encrypted-keys into the kernel user keyring during the initramfs phase. +This provides the unlock function access to all the related keys that contain +the passphrase for the respective nvdimms. It is also recommended that the +keys are injected before libnvdimm is loaded by modprobe. + +5. Update +--------- +When doing an update, it is expected that the existing key is removed from +the kernel user keyring and reinjected as different (old) key. It's irrelevant +what the key description is for the old key since we are only interested in the +keyid when doing the update operation. It is also expected that the new key +is injected with the description format described from earlier in this +document. The update command written to the sysfs attribute will be with +the format: +update + +If there is no old keyid due to a security enabling, then a 0 should be +passed in. + +6. Freeze +--------- +The freeze operation does not require any keys. The security config can be +frozen by a user with root privelege. + +7. Disable +---------- +The security disable command format is: +disable + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +8. Secure Erase +--------------- +The command format for doing a secure erase is: +erase + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +9. Overwrite +------------ +The command format for doing an overwrite is: +overwrite + +Overwrite can be done without a key if security is not enabled. A key serial +of 0 can be passed in to indicate no key. + +The sysfs attribute "security" can be polled to wait on overwrite completion. +Overwrite can last tens of minutes or more depending on nvdimm size. + +An encrypted-key with the current user passphrase that is tied to the nvdimm +should be injected and its keyid should be passed in via sysfs. + +10. Master Update +----------------- +The command format for doing a master update is: +update + +The operating mechanism for master update is identical to update except the +master passphrase key is passed to the kernel. The master passphrase key +is just another encrypted-key. + +This command is only available when security is disabled. + +11. Master Erase +---------------- +The command format for doing a master erase is: +master_erase + +This command has the same operating mechanism as erase except the master +passphrase key is passed to the kernel. The master passphrase key is just +another encrypted-key. + +This command is only available when the master security is enabled, indicated +by the extended security status. + +[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf + +[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf diff --git a/Documentation/nvdimm/security.txt b/Documentation/nvdimm/security.txt deleted file mode 100644 index 4c36c05ca98e..000000000000 --- a/Documentation/nvdimm/security.txt +++ /dev/null @@ -1,141 +0,0 @@ -NVDIMM SECURITY -=============== - -1. Introduction ---------------- - -With the introduction of Intel Device Specific Methods (DSM) v1.8 -specification [1], security DSMs are introduced. The spec added the following -security DSMs: "get security state", "set passphrase", "disable passphrase", -"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops -data structure has been added to struct dimm in order to support the security -operations and generic APIs are exposed to allow vendor neutral operations. - -2. Sysfs Interface ------------------- -The "security" sysfs attribute is provided in the nvdimm sysfs directory. For -example: -/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security - -The "show" attribute of that attribute will display the security state for -that DIMM. The following states are available: disabled, unlocked, locked, -frozen, and overwrite. If security is not supported, the sysfs attribute -will not be visible. - -The "store" attribute takes several commands when it is being written to -in order to support some of the security functionalities: -update - enable or update passphrase. -disable - disable enabled security and remove key. -freeze - freeze changing of security states. -erase - delete existing user encryption key. -overwrite - wipe the entire nvdimm. -master_update - enable or update master passphrase. -master_erase - delete existing user encryption key. - -3. Key Management ------------------ - -The key is associated to the payload by the DIMM id. For example: -# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id -8089-a2-1740-00000133 -The DIMM id would be provided along with the key payload (passphrase) to -the kernel. - -The security keys are managed on the basis of a single key per DIMM. The -key "passphrase" is expected to be 32bytes long. This is similar to the ATA -security specification [2]. A key is initially acquired via the request_key() -kernel API call during nvdimm unlock. It is up to the user to make sure that -all the keys are in the kernel user keyring for unlock. - -A nvdimm encrypted-key of format enc32 has the description format of: -nvdimm: - -See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating -encrypted-keys of enc32 format. TPM usage with a master trusted key is -preferred for sealing the encrypted-keys. - -4. Unlocking ------------- -When the DIMMs are being enumerated by the kernel, the kernel will attempt to -retrieve the key from the kernel user keyring. This is the only time -a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked -until reboot. Typically an entity (i.e. shell script) will inject all the -relevant encrypted-keys into the kernel user keyring during the initramfs phase. -This provides the unlock function access to all the related keys that contain -the passphrase for the respective nvdimms. It is also recommended that the -keys are injected before libnvdimm is loaded by modprobe. - -5. Update ---------- -When doing an update, it is expected that the existing key is removed from -the kernel user keyring and reinjected as different (old) key. It's irrelevant -what the key description is for the old key since we are only interested in the -keyid when doing the update operation. It is also expected that the new key -is injected with the description format described from earlier in this -document. The update command written to the sysfs attribute will be with -the format: -update - -If there is no old keyid due to a security enabling, then a 0 should be -passed in. - -6. Freeze ---------- -The freeze operation does not require any keys. The security config can be -frozen by a user with root privelege. - -7. Disable ----------- -The security disable command format is: -disable - -An key with the current passphrase payload that is tied to the nvdimm should be -in the kernel user keyring. - -8. Secure Erase ---------------- -The command format for doing a secure erase is: -erase - -An key with the current passphrase payload that is tied to the nvdimm should be -in the kernel user keyring. - -9. Overwrite ------------- -The command format for doing an overwrite is: -overwrite - -Overwrite can be done without a key if security is not enabled. A key serial -of 0 can be passed in to indicate no key. - -The sysfs attribute "security" can be polled to wait on overwrite completion. -Overwrite can last tens of minutes or more depending on nvdimm size. - -An encrypted-key with the current user passphrase that is tied to the nvdimm -should be injected and its keyid should be passed in via sysfs. - -10. Master Update ------------------ -The command format for doing a master update is: -update - -The operating mechanism for master update is identical to update except the -master passphrase key is passed to the kernel. The master passphrase key -is just another encrypted-key. - -This command is only available when security is disabled. - -11. Master Erase ----------------- -The command format for doing a master erase is: -master_erase - -This command has the same operating mechanism as erase except the master -passphrase key is passed to the kernel. The master passphrase key is just -another encrypted-key. - -This command is only available when the master security is enabled, indicated -by the extended security status. - -[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf -[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf -- cgit