From 011f129fee4bd064a3db30ca1a0139548a619482 Mon Sep 17 00:00:00 2001 From: Bagas Sanjaya Date: Wed, 29 Nov 2023 19:39:47 +0700 Subject: Documentation: xfs: consolidate XFS docs into its own subdirectory XFS docs are currently in upper-level Documentation/filesystems. Although these are currently 4 docs, they are already outstanding as a group and can be moved to its own subdirectory. Consolidate them into Documentation/filesystems/xfs/. Signed-off-by: Bagas Sanjaya Reviewed-by: Bill O'Donnell Reviewed-by: "Darrick J. Wong" Signed-off-by: Chandan Babu R --- Documentation/filesystems/index.rst | 5 +- .../filesystems/xfs-delayed-logging-design.rst | 1087 ---- .../filesystems/xfs-maintainer-entry-profile.rst | 194 - .../filesystems/xfs-online-fsck-design.rst | 5315 -------------------- .../filesystems/xfs-self-describing-metadata.rst | 353 -- Documentation/filesystems/xfs/index.rst | 14 + .../filesystems/xfs/xfs-delayed-logging-design.rst | 1087 ++++ .../xfs/xfs-maintainer-entry-profile.rst | 194 + .../filesystems/xfs/xfs-online-fsck-design.rst | 5315 ++++++++++++++++++++ .../xfs/xfs-self-describing-metadata.rst | 353 ++ .../maintainer/maintainer-entry-profile.rst | 2 +- MAINTAINERS | 4 +- 12 files changed, 6967 insertions(+), 6956 deletions(-) delete mode 100644 Documentation/filesystems/xfs-delayed-logging-design.rst delete mode 100644 Documentation/filesystems/xfs-maintainer-entry-profile.rst delete mode 100644 Documentation/filesystems/xfs-online-fsck-design.rst delete mode 100644 Documentation/filesystems/xfs-self-describing-metadata.rst create mode 100644 Documentation/filesystems/xfs/index.rst create mode 100644 Documentation/filesystems/xfs/xfs-delayed-logging-design.rst create mode 100644 Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst create mode 100644 Documentation/filesystems/xfs/xfs-online-fsck-design.rst create mode 100644 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 09cade7eaefc..e18bc5ae3b35 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -121,8 +121,5 @@ Documentation for filesystem implementations. udf virtiofs vfat - xfs-delayed-logging-design - xfs-maintainer-entry-profile - xfs-self-describing-metadata - xfs-online-fsck-design + xfs/index zonefs diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst deleted file mode 100644 index 6402ab8e370c..000000000000 --- a/Documentation/filesystems/xfs-delayed-logging-design.rst +++ /dev/null @@ -1,1087 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -================== -XFS Logging Design -================== - -Preamble -======== - -This document describes the design and algorithms that the XFS journalling -subsystem is based on. This document describes the design and algorithms that -the XFS journalling subsystem is based on so that readers may familiarize -themselves with the general concepts of how transaction processing in XFS works. - -We begin with an overview of transactions in XFS, followed by describing how -transaction reservations are structured and accounted, and then move into how we -guarantee forwards progress for long running transactions with finite initial -reservations bounds. At this point we need to explain how relogging works. With -the basic concepts covered, the design of the delayed logging mechanism is -documented. - - -Introduction -============ - -XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata -are atomic and recoverable. For reasons of space and time efficiency, the -logging mechanisms are varied and complex, combining intents, logical and -physical logging mechanisms to provide the necessary recovery guarantees the -filesystem requires. - -Some objects, such as inodes and dquots, are logged in logical format where the -details logged are made up of the changes to in-core structures rather than -on-disk structures. Other objects - typically buffers - have their physical -changes logged. Long running atomic modifications have individual changes -chained together by intents, ensuring that journal recovery can restart and -finish an operation that was only partially done when the system stopped -functioning. - -The reason for these differences is to keep the amount of log space and CPU time -required to process objects being modified as small as possible and hence the -logging overhead as low as possible. Some items are very frequently modified, -and some parts of objects are more frequently modified than others, so keeping -the overhead of metadata logging low is of prime importance. - -The method used to log an item or chain modifications together isn't -particularly important in the scope of this document. It suffices to know that -the method used for logging a particular object or chaining modifications -together are different and are dependent on the object and/or modification being -performed. The logging subsystem only cares that certain specific rules are -followed to guarantee forwards progress and prevent deadlocks. - - -Transactions in XFS -=================== - -XFS has two types of high level transactions, defined by the type of log space -reservation they take. These are known as "one shot" and "permanent" -transactions. Permanent transaction reservations can take reservations that span -commit boundaries, whilst "one shot" transactions are for a single atomic -modification. - -The type and size of reservation must be matched to the modification taking -place. This means that permanent transactions can be used for one-shot -modifications, but one-shot reservations cannot be used for permanent -transactions. - -In the code, a one-shot transaction pattern looks somewhat like this:: - - tp = xfs_trans_alloc() - - - - xfs_trans_commit(tp); - -As items are modified in the transaction, the dirty regions in those items are -tracked via the transaction handle. Once the transaction is committed, all -resources joined to it are released, along with the remaining unused reservation -space that was taken at the transaction allocation time. - -In contrast, a permanent transaction is made up of multiple linked individual -transactions, and the pattern looks like this:: - - tp = xfs_trans_alloc() - xfs_ilock(ip, XFS_ILOCK_EXCL) - - loop { - xfs_trans_ijoin(tp, 0); - - xfs_trans_log_inode(tp, ip); - xfs_trans_roll(&tp); - } - - xfs_trans_commit(tp); - xfs_iunlock(ip, XFS_ILOCK_EXCL); - -While this might look similar to a one-shot transaction, there is an important -difference: xfs_trans_roll() performs a specific operation that links two -transactions together:: - - ntp = xfs_trans_dup(tp); - xfs_trans_commit(tp); - xfs_trans_reserve(ntp); - -This results in a series of "rolling transactions" where the inode is locked -across the entire chain of transactions. Hence while this series of rolling -transactions is running, nothing else can read from or write to the inode and -this provides a mechanism for complex changes to appear atomic from an external -observer's point of view. - -It is important to note that a series of rolling transactions in a permanent -transaction does not form an atomic change in the journal. While each -individual modification is atomic, the chain is *not atomic*. If we crash half -way through, then recovery will only replay up to the last transactional -modification the loop made that was committed to the journal. - -This affects long running permanent transactions in that it is not possible to -predict how much of a long running operation will actually be recovered because -there is no guarantee of how much of the operation reached stale storage. Hence -if a long running operation requires multiple transactions to fully complete, -the high level operation must use intents and deferred operations to guarantee -recovery can complete the operation once the first transactions is persisted in -the on-disk journal. - - -Transactions are Asynchronous -============================= - -In XFS, all high level transactions are asynchronous by default. This means that -xfs_trans_commit() does not guarantee that the modification has been committed -to stable storage when it returns. Hence when a system crashes, not all the -completed transactions will be replayed during recovery. - -However, the logging subsystem does provide global ordering guarantees, such -that if a specific change is seen after recovery, all metadata modifications -that were committed prior to that change will also be seen. - -For single shot operations that need to reach stable storage immediately, or -ensuring that a long running permanent transaction is fully committed once it is -complete, we can explicitly tag a transaction as synchronous. This will trigger -a "log force" to flush the outstanding committed transactions to stable storage -in the journal and wait for that to complete. - -Synchronous transactions are rarely used, however, because they limit logging -throughput to the IO latency limitations of the underlying storage. Instead, we -tend to use log forces to ensure modifications are on stable storage only when -a user operation requires a synchronisation point to occur (e.g. fsync). - - -Transaction Reservations -======================== - -It has been mentioned a number of times now that the logging subsystem needs to -provide a forwards progress guarantee so that no modification ever stalls -because it can't be written to the journal due to a lack of space in the -journal. This is achieved by the transaction reservations that are made when -a transaction is first allocated. For permanent transactions, these reservations -are maintained as part of the transaction rolling mechanism. - -A transaction reservation provides a guarantee that there is physical log space -available to write the modification into the journal before we start making -modifications to objects and items. As such, the reservation needs to be large -enough to take into account the amount of metadata that the change might need to -log in the worst case. This means that if we are modifying a btree in the -transaction, we have to reserve enough space to record a full leaf-to-root split -of the btree. As such, the reservations are quite complex because we have to -take into account all the hidden changes that might occur. - -For example, a user data extent allocation involves allocating an extent from -free space, which modifies the free space trees. That's two btrees. Inserting -the extent into the inode's extent map might require a split of the extent map -btree, which requires another allocation that can modify the free space trees -again. Then we might have to update reverse mappings, which modifies yet -another btree which might require more space. And so on. Hence the amount of -metadata that a "simple" operation can modify can be quite large. - -This "worst case" calculation provides us with the static "unit reservation" -for the transaction that is calculated at mount time. We must guarantee that the -log has this much space available before the transaction is allowed to proceed -so that when we come to write the dirty metadata into the log we don't run out -of log space half way through the write. - -For one-shot transactions, a single unit space reservation is all that is -required for the transaction to proceed. For permanent transactions, however, we -also have a "log count" that affects the size of the reservation that is to be -made. - -While a permanent transaction can get by with a single unit of space -reservation, it is somewhat inefficient to do this as it requires the -transaction rolling mechanism to re-reserve space on every transaction roll. We -know from the implementation of the permanent transactions how many transaction -rolls are likely for the common modifications that need to be made. - -For example, an inode allocation is typically two transactions - one to -physically allocate a free inode chunk on disk, and another to allocate an inode -from an inode chunk that has free inodes in it. Hence for an inode allocation -transaction, we might set the reservation log count to a value of 2 to indicate -that the common/fast path transaction will commit two linked transactions in a -chain. Each time a permanent transaction rolls, it consumes an entire unit -reservation. - -Hence when the permanent transaction is first allocated, the log space -reservation is increased from a single unit reservation to multiple unit -reservations. That multiple is defined by the reservation log count, and this -means we can roll the transaction multiple times before we have to re-reserve -log space when we roll the transaction. This ensures that the common -modifications we make only need to reserve log space once. - -If the log count for a permanent transaction reaches zero, then it needs to -re-reserve physical space in the log. This is somewhat complex, and requires -an understanding of how the log accounts for space that has been reserved. - - -Log Space Accounting -==================== - -The position in the log is typically referred to as a Log Sequence Number (LSN). -The log is circular, so the positions in the log are defined by the combination -of a cycle number - the number of times the log has been overwritten - and the -offset into the log. A LSN carries the cycle in the upper 32 bits and the -offset in the lower 32 bits. The offset is in units of "basic blocks" (512 -bytes). Hence we can do realtively simple LSN based math to keep track of -available space in the log. - -Log space accounting is done via a pair of constructs called "grant heads". The -position of the grant heads is an absolute value, so the amount of space -available in the log is defined by the distance between the position of the -grant head and the current log tail. That is, how much space can be -reserved/consumed before the grant heads would fully wrap the log and overtake -the tail position. - -The first grant head is the "reserve" head. This tracks the byte count of the -reservations currently held by active transactions. It is a purely in-memory -accounting of the space reservation and, as such, actually tracks byte offsets -into the log rather than basic blocks. Hence it technically isn't using LSNs to -represent the log position, but it is still treated like a split {cycle,offset} -tuple for the purposes of tracking reservation space. - -The reserve grant head is used to accurately account for exact transaction -reservations amounts and the exact byte count that modifications actually make -and need to write into the log. The reserve head is used to prevent new -transactions from taking new reservations when the head reaches the current -tail. It will block new reservations in a FIFO queue and as the log tail moves -forward it will wake them in order once sufficient space is available. This FIFO -mechanism ensures no transaction is starved of resources when log space -shortages occur. - -The other grant head is the "write" head. Unlike the reserve head, this grant -head contains an LSN and it tracks the physical space usage in the log. While -this might sound like it is accounting the same state as the reserve grant head -- and it mostly does track exactly the same location as the reserve grant head - -there are critical differences in behaviour between them that provides the -forwards progress guarantees that rolling permanent transactions require. - -These differences when a permanent transaction is rolled and the internal "log -count" reaches zero and the initial set of unit reservations have been -exhausted. At this point, we still require a log space reservation to continue -the next transaction in the sequeunce, but we have none remaining. We cannot -sleep during the transaction commit process waiting for new log space to become -available, as we may end up on the end of the FIFO queue and the items we have -locked while we sleep could end up pinning the tail of the log before there is -enough free space in the log to fulfill all of the pending reservations and -then wake up transaction commit in progress. - -To take a new reservation without sleeping requires us to be able to take a -reservation even if there is no reservation space currently available. That is, -we need to be able to *overcommit* the log reservation space. As has already -been detailed, we cannot overcommit physical log space. However, the reserve -grant head does not track physical space - it only accounts for the amount of -reservations we currently have outstanding. Hence if the reserve head passes -over the tail of the log all it means is that new reservations will be throttled -immediately and remain throttled until the log tail is moved forward far enough -to remove the overcommit and start taking new reservations. In other words, we -can overcommit the reserve head without violating the physical log head and tail -rules. - -As a result, permanent transactions only "regrant" reservation space during -xfs_trans_commit() calls, while the physical log space reservation - tracked by -the write head - is then reserved separately by a call to xfs_log_reserve() -after the commit completes. Once the commit completes, we can sleep waiting for -physical log space to be reserved from the write grant head, but only if one -critical rule has been observed:: - - Code using permanent reservations must always log the items they hold - locked across each transaction they roll in the chain. - -"Re-logging" the locked items on every transaction roll ensures that the items -attached to the transaction chain being rolled are always relocated to the -physical head of the log and so do not pin the tail of the log. If a locked item -pins the tail of the log when we sleep on the write reservation, then we will -deadlock the log as we cannot take the locks needed to write back that item and -move the tail of the log forwards to free up write grant space. Re-logging the -locked items avoids this deadlock and guarantees that the log reservation we are -making cannot self-deadlock. - -If all rolling transactions obey this rule, then they can all make forwards -progress independently because nothing will block the progress of the log -tail moving forwards and hence ensuring that write grant space is always -(eventually) made available to permanent transactions no matter how many times -they roll. - - -Re-logging Explained -==================== - -XFS allows multiple separate modifications to a single object to be carried in -the log at any given time. This allows the log to avoid needing to flush each -change to disk before recording a new change to the object. XFS does this via a -method called "re-logging". Conceptually, this is quite simple - all it requires -is that any new change to the object is recorded with a *new copy* of all the -existing changes in the new transaction that is written to the log. - -That is, if we have a sequence of changes A through to F, and the object was -written to disk after change D, we would see in the log the following series -of transactions, their contents and the log sequence number (LSN) of the -transaction:: - - Transaction Contents LSN - A A X - B A+B X+n - C A+B+C X+n+m - D A+B+C+D X+n+m+o - - E E Y (> X+n+m+o) - F E+F Y+p - -In other words, each time an object is relogged, the new transaction contains -the aggregation of all the previous changes currently held only in the log. - -This relogging technique allows objects to be moved forward in the log so that -an object being relogged does not prevent the tail of the log from ever moving -forward. This can be seen in the table above by the changing (increasing) LSN -of each subsequent transaction, and it's the technique that allows us to -implement long-running, multiple-commit permanent transactions. - -A typical example of a rolling transaction is the removal of extents from an -inode which can only be done at a rate of two extents per transaction because -of reservation size limitations. Hence a rolling extent removal transaction -keeps relogging the inode and btree buffers as they get modified in each -removal operation. This keeps them moving forward in the log as the operation -progresses, ensuring that current operation never gets blocked by itself if the -log wraps around. - -Hence it can be seen that the relogging operation is fundamental to the correct -working of the XFS journalling subsystem. From the above description, most -people should be able to see why the XFS metadata operations writes so much to -the log - repeated operations to the same objects write the same changes to -the log over and over again. Worse is the fact that objects tend to get -dirtier as they get relogged, so each subsequent transaction is writing more -metadata into the log. - -It should now also be obvious how relogging and asynchronous transactions go -hand in hand. That is, transactions don't get written to the physical journal -until either a log buffer is filled (a log buffer can hold multiple -transactions) or a synchronous operation forces the log buffers holding the -transactions to disk. This means that XFS is doing aggregation of transactions -in memory - batching them, if you like - to minimise the impact of the log IO on -transaction throughput. - -The limitation on asynchronous transaction throughput is the number and size of -log buffers made available by the log manager. By default there are 8 log -buffers available and the size of each is 32kB - the size can be increased up -to 256kB by use of a mount option. - -Effectively, this gives us the maximum bound of outstanding metadata changes -that can be made to the filesystem at any point in time - if all the log -buffers are full and under IO, then no more transactions can be committed until -the current batch completes. It is now common for a single current CPU core to -be to able to issue enough transactions to keep the log buffers full and under -IO permanently. Hence the XFS journalling subsystem can be considered to be IO -bound. - -Delayed Logging: Concepts -========================= - -The key thing to note about the asynchronous logging combined with the -relogging technique XFS uses is that we can be relogging changed objects -multiple times before they are committed to disk in the log buffers. If we -return to the previous relogging example, it is entirely possible that -transactions A through D are committed to disk in the same log buffer. - -That is, a single log buffer may contain multiple copies of the same object, -but only one of those copies needs to be there - the last one "D", as it -contains all the changes from the previous changes. In other words, we have one -necessary copy in the log buffer, and three stale copies that are simply -wasting space. When we are doing repeated operations on the same set of -objects, these "stale objects" can be over 90% of the space used in the log -buffers. It is clear that reducing the number of stale objects written to the -log would greatly reduce the amount of metadata we write to the log, and this -is the fundamental goal of delayed logging. - -From a conceptual point of view, XFS is already doing relogging in memory (where -memory == log buffer), only it is doing it extremely inefficiently. It is using -logical to physical formatting to do the relogging because there is no -infrastructure to keep track of logical changes in memory prior to physically -formatting the changes in a transaction to the log buffer. Hence we cannot avoid -accumulating stale objects in the log buffers. - -Delayed logging is the name we've given to keeping and tracking transactional -changes to objects in memory outside the log buffer infrastructure. Because of -the relogging concept fundamental to the XFS journalling subsystem, this is -actually relatively easy to do - all the changes to logged items are already -tracked in the current infrastructure. The big problem is how to accumulate -them and get them to the log in a consistent, recoverable manner. -Describing the problems and how they have been solved is the focus of this -document. - -One of the key changes that delayed logging makes to the operation of the -journalling subsystem is that it disassociates the amount of outstanding -metadata changes from the size and number of log buffers available. In other -words, instead of there only being a maximum of 2MB of transaction changes not -written to the log at any point in time, there may be a much greater amount -being accumulated in memory. Hence the potential for loss of metadata on a -crash is much greater than for the existing logging mechanism. - -It should be noted that this does not change the guarantee that log recovery -will result in a consistent filesystem. What it does mean is that as far as the -recovered filesystem is concerned, there may be many thousands of transactions -that simply did not occur as a result of the crash. This makes it even more -important that applications that care about their data use fsync() where they -need to ensure application level data integrity is maintained. - -It should be noted that delayed logging is not an innovative new concept that -warrants rigorous proofs to determine whether it is correct or not. The method -of accumulating changes in memory for some period before writing them to the -log is used effectively in many filesystems including ext3 and ext4. Hence -no time is spent in this document trying to convince the reader that the -concept is sound. Instead it is simply considered a "solved problem" and as -such implementing it in XFS is purely an exercise in software engineering. - -The fundamental requirements for delayed logging in XFS are simple: - - 1. Reduce the amount of metadata written to the log by at least - an order of magnitude. - 2. Supply sufficient statistics to validate Requirement #1. - 3. Supply sufficient new tracing infrastructure to be able to debug - problems with the new code. - 4. No on-disk format change (metadata or log format). - 5. Enable and disable with a mount option. - 6. No performance regressions for synchronous transaction workloads. - -Delayed Logging: Design -======================= - -Storing Changes ---------------- - -The problem with accumulating changes at a logical level (i.e. just using the -existing log item dirty region tracking) is that when it comes to writing the -changes to the log buffers, we need to ensure that the object we are formatting -is not changing while we do this. This requires locking the object to prevent -concurrent modification. Hence flushing the logical changes to the log would -require us to lock every object, format them, and then unlock them again. - -This introduces lots of scope for deadlocks with transactions that are already -running. For example, a transaction has object A locked and modified, but needs -the delayed logging tracking lock to commit the transaction. However, the -flushing thread has the delayed logging tracking lock already held, and is -trying to get the lock on object A to flush it to the log buffer. This appears -to be an unsolvable deadlock condition, and it was solving this problem that -was the barrier to implementing delayed logging for so long. - -The solution is relatively simple - it just took a long time to recognise it. -Put simply, the current logging code formats the changes to each item into an -vector array that points to the changed regions in the item. The log write code -simply copies the memory these vectors point to into the log buffer during -transaction commit while the item is locked in the transaction. Instead of -using the log buffer as the destination of the formatting code, we can use an -allocated memory buffer big enough to fit the formatted vector. - -If we then copy the vector into the memory buffer and rewrite the vector to -point to the memory buffer rather than the object itself, we now have a copy of -the changes in a format that is compatible with the log buffer writing code. -that does not require us to lock the item to access. This formatting and -rewriting can all be done while the object is locked during transaction commit, -resulting in a vector that is transactionally consistent and can be accessed -without needing to lock the owning item. - -Hence we avoid the need to lock items when we need to flush outstanding -asynchronous transactions to the log. The differences between the existing -formatting method and the delayed logging formatting can be seen in the -diagram below. - -Current format log vector:: - - Object +---------------------------------------------+ - Vector 1 +----+ - Vector 2 +----+ - Vector 3 +----------+ - -After formatting:: - - Log Buffer +-V1-+-V2-+----V3----+ - -Delayed logging vector:: - - Object +---------------------------------------------+ - Vector 1 +----+ - Vector 2 +----+ - Vector 3 +----------+ - -After formatting:: - - Memory Buffer +-V1-+-V2-+----V3----+ - Vector 1 +----+ - Vector 2 +----+ - Vector 3 +----------+ - -The memory buffer and associated vector need to be passed as a single object, -but still need to be associated with the parent object so if the object is -relogged we can replace the current memory buffer with a new memory buffer that -contains the latest changes. - -The reason for keeping the vector around after we've formatted the memory -buffer is to support splitting vectors across log buffer boundaries correctly. -If we don't keep the vector around, we do not know where the region boundaries -are in the item, so we'd need a new encapsulation method for regions in the log -buffer writing (i.e. double encapsulation). This would be an on-disk format -change and as such is not desirable. It also means we'd have to write the log -region headers in the formatting stage, which is problematic as there is per -region state that needs to be placed into the headers during the log write. - -Hence we need to keep the vector, but by attaching the memory buffer to it and -rewriting the vector addresses to point at the memory buffer we end up with a -self-describing object that can be passed to the log buffer write code to be -handled in exactly the same manner as the existing log vectors are handled. -Hence we avoid needing a new on-disk format to handle items that have been -relogged in memory. - - -Tracking Changes ----------------- - -Now that we can record transactional changes in memory in a form that allows -them to be used without limitations, we need to be able to track and accumulate -them so that they can be written to the log at some later point in time. The -log item is the natural place to store this vector and buffer, and also makes sense -to be the object that is used to track committed objects as it will always -exist once the object has been included in a transaction. - -The log item is already used to track the log items that have been written to -the log but not yet written to disk. Such log items are considered "active" -and as such are stored in the Active Item List (AIL) which is a LSN-ordered -double linked list. Items are inserted into this list during log buffer IO -completion, after which they are unpinned and can be written to disk. An object -that is in the AIL can be relogged, which causes the object to be pinned again -and then moved forward in the AIL when the log buffer IO completes for that -transaction. - -Essentially, this shows that an item that is in the AIL can still be modified -and relogged, so any tracking must be separate to the AIL infrastructure. As -such, we cannot reuse the AIL list pointers for tracking committed items, nor -can we store state in any field that is protected by the AIL lock. Hence the -committed item tracking needs its own locks, lists and state fields in the log -item. - -Similar to the AIL, tracking of committed items is done through a new list -called the Committed Item List (CIL). The list tracks log items that have been -committed and have formatted memory buffers attached to them. It tracks objects -in transaction commit order, so when an object is relogged it is removed from -its place in the list and re-inserted at the tail. This is entirely arbitrary -and done to make it easy for debugging - the last items in the list are the -ones that are most recently modified. Ordering of the CIL is not necessary for -transactional integrity (as discussed in the next section) so the ordering is -done for convenience/sanity of the developers. - - -Delayed Logging: Checkpoints ----------------------------- - -When we have a log synchronisation event, commonly known as a "log force", -all the items in the CIL must be written into the log via the log buffers. -We need to write these items in the order that they exist in the CIL, and they -need to be written as an atomic transaction. The need for all the objects to be -written as an atomic transaction comes from the requirements of relogging and -log replay - all the changes in all the objects in a given transaction must -either be completely replayed during log recovery, or not replayed at all. If -a transaction is not replayed because it is not complete in the log, then -no later transactions should be replayed, either. - -To fulfill this requirement, we need to write the entire CIL in a single log -transaction. Fortunately, the XFS log code has no fixed limit on the size of a -transaction, nor does the log replay code. The only fundamental limit is that -the transaction cannot be larger than just under half the size of the log. The -reason for this limit is that to find the head and tail of the log, there must -be at least one complete transaction in the log at any given time. If a -transaction is larger than half the log, then there is the possibility that a -crash during the write of a such a transaction could partially overwrite the -only complete previous transaction in the log. This will result in a recovery -failure and an inconsistent filesystem and hence we must enforce the maximum -size of a checkpoint to be slightly less than a half the log. - -Apart from this size requirement, a checkpoint transaction looks no different -to any other transaction - it contains a transaction header, a series of -formatted log items and a commit record at the tail. From a recovery -perspective, the checkpoint transaction is also no different - just a lot -bigger with a lot more items in it. The worst case effect of this is that we -might need to tune the recovery transaction object hash size. - -Because the checkpoint is just another transaction and all the changes to log -items are stored as log vectors, we can use the existing log buffer writing -code to write the changes into the log. To do this efficiently, we need to -minimise the time we hold the CIL locked while writing the checkpoint -transaction. The current log write code enables us to do this easily with the -way it separates the writing of the transaction contents (the log vectors) from -the transaction commit record, but tracking this requires us to have a -per-checkpoint context that travels through the log write process through to -checkpoint completion. - -Hence a checkpoint has a context that tracks the state of the current -checkpoint from initiation to checkpoint completion. A new context is initiated -at the same time a checkpoint transaction is started. That is, when we remove -all the current items from the CIL during a checkpoint operation, we move all -those changes into the current checkpoint context. We then initialise a new -context and attach that to the CIL for aggregation of new transactions. - -This allows us to unlock the CIL immediately after transfer of all the -committed items and effectively allows new transactions to be issued while we -are formatting the checkpoint into the log. It also allows concurrent -checkpoints to be written into the log buffers in the case of log force heavy -workloads, just like the existing transaction commit code does. This, however, -requires that we strictly order the commit records in the log so that -checkpoint sequence order is maintained during log replay. - -To ensure that we can be writing an item into a checkpoint transaction at -the same time another transaction modifies the item and inserts the log item -into the new CIL, then checkpoint transaction commit code cannot use log items -to store the list of log vectors that need to be written into the transaction. -Hence log vectors need to be able to be chained together to allow them to be -detached from the log items. That is, when the CIL is flushed the memory -buffer and log vector attached to each log item needs to be attached to the -checkpoint context so that the log item can be released. In diagrammatic form, -the CIL would look like this before the flush:: - - CIL Head - | - V - Log Item <-> log vector 1 -> memory buffer - | -> vector array - V - Log Item <-> log vector 2 -> memory buffer - | -> vector array - V - ...... - | - V - Log Item <-> log vector N-1 -> memory buffer - | -> vector array - V - Log Item <-> log vector N -> memory buffer - -> vector array - -And after the flush the CIL head is empty, and the checkpoint context log -vector list would look like:: - - Checkpoint Context - | - V - log vector 1 -> memory buffer - | -> vector array - | -> Log Item - V - log vector 2 -> memory buffer - | -> vector array - | -> Log Item - V - ...... - | - V - log vector N-1 -> memory buffer - | -> vector array - | -> Log Item - V - log vector N -> memory buffer - -> vector array - -> Log Item - -Once this transfer is done, the CIL can be unlocked and new transactions can -start, while the checkpoint flush code works over the log vector chain to -commit the checkpoint. - -Once the checkpoint is written into the log buffers, the checkpoint context is -attached to the log buffer that the commit record was written to along with a -completion callback. Log IO completion will call that callback, which can then -run transaction committed processing for the log items (i.e. insert into AIL -and unpin) in the log vector chain and then free the log vector chain and -checkpoint context. - -Discussion Point: I am uncertain as to whether the log item is the most -efficient way to track vectors, even though it seems like the natural way to do -it. The fact that we walk the log items (in the CIL) just to chain the log -vectors and break the link between the log item and the log vector means that -we take a cache line hit for the log item list modification, then another for -the log vector chaining. If we track by the log vectors, then we only need to -break the link between the log item and the log vector, which means we should -dirty only the log item cachelines. Normally I wouldn't be concerned about one -vs two dirty cachelines except for the fact I've seen upwards of 80,000 log -vectors in one checkpoint transaction. I'd guess this is a "measure and -compare" situation that can be done after a working and reviewed implementation -is in the dev tree.... - -Delayed Logging: Checkpoint Sequencing --------------------------------------- - -One of the key aspects of the XFS transaction subsystem is that it tags -committed transactions with the log sequence number of the transaction commit. -This allows transactions to be issued asynchronously even though there may be -future operations that cannot be completed until that transaction is fully -committed to the log. In the rare case that a dependent operation occurs (e.g. -re-using a freed metadata extent for a data extent), a special, optimised log -force can be issued to force the dependent transaction to disk immediately. - -To do this, transactions need to record the LSN of the commit record of the -transaction. This LSN comes directly from the log buffer the transaction is -written into. While this works just fine for the existing transaction -mechanism, it does not work for delayed logging because transactions are not -written directly into the log buffers. Hence some other method of sequencing -transactions is required. - -As discussed in the checkpoint section, delayed logging uses per-checkpoint -contexts, and as such it is simple to assign a sequence number to each -checkpoint. Because the switching of checkpoint contexts must be done -atomically, it is simple to ensure that each new context has a monotonically -increasing sequence number assigned to it without the need for an external -atomic counter - we can just take the current context sequence number and add -one to it for the new context. - -Then, instead of assigning a log buffer LSN to the transaction commit LSN -during the commit, we can assign the current checkpoint sequence. This allows -operations that track transactions that have not yet completed know what -checkpoint sequence needs to be committed before they can continue. As a -result, the code that forces the log to a specific LSN now needs to ensure that -the log forces to a specific checkpoint. - -To ensure that we can do this, we need to track all the checkpoint contexts -that are currently committing to the log. When we flush a checkpoint, the -context gets added to a "committing" list which can be searched. When a -checkpoint commit completes, it is removed from the committing list. Because -the checkpoint context records the LSN of the commit record for the checkpoint, -we can also wait on the log buffer that contains the commit record, thereby -using the existing log force mechanisms to execute synchronous forces. - -It should be noted that the synchronous forces may need to be extended with -mitigation algorithms similar to the current log buffer code to allow -aggregation of multiple synchronous transactions if there are already -synchronous transactions being flushed. Investigation of the performance of the -current design is needed before making any decisions here. - -The main concern with log forces is to ensure that all the previous checkpoints -are also committed to disk before the one we need to wait for. Therefore we -need to check that all the prior contexts in the committing list are also -complete before waiting on the one we need to complete. We do this -synchronisation in the log force code so that we don't need to wait anywhere -else for such serialisation - it only matters when we do a log force. - -The only remaining complexity is that a log force now also has to handle the -case where the forcing sequence number is the same as the current context. That -is, we need to flush the CIL and potentially wait for it to complete. This is a -simple addition to the existing log forcing code to check the sequence numbers -and push if required. Indeed, placing the current sequence checkpoint flush in -the log force code enables the current mechanism for issuing synchronous -transactions to remain untouched (i.e. commit an asynchronous transaction, then -force the log at the LSN of that transaction) and so the higher level code -behaves the same regardless of whether delayed logging is being used or not. - -Delayed Logging: Checkpoint Log Space Accounting ------------------------------------------------- - -The big issue for a checkpoint transaction is the log space reservation for the -transaction. We don't know how big a checkpoint transaction is going to be -ahead of time, nor how many log buffers it will take to write out, nor the -number of split log vector regions are going to be used. We can track the -amount of log space required as we add items to the commit item list, but we -still need to reserve the space in the log for the checkpoint. - -A typical transaction reserves enough space in the log for the worst case space -usage of the transaction. The reservation accounts for log record headers, -transaction and region headers, headers for split regions, buffer tail padding, -etc. as well as the actual space for all the changed metadata in the -transaction. While some of this is fixed overhead, much of it is dependent on -the size of the transaction and the number of regions being logged (the number -of log vectors in the transaction). - -An example of the differences would be logging directory changes versus logging -inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then -there are lots of transactions that only contain an inode core and an inode log -format structure. That is, two vectors totaling roughly 150 bytes. If we modify -10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each -vector is 12 bytes, so the total to be logged is approximately 1.75MB. In -comparison, if we are logging full directory buffers, they are typically 4KB -each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a -buffer format structure for each buffer - roughly 800 vectors or 1.51MB total -space. From this, it should be obvious that a static log space reservation is -not particularly flexible and is difficult to select the "optimal value" for -all workloads. - -Further, if we are going to use a static reservation, which bit of the entire -reservation does it cover? We account for space used by the transaction -reservation by tracking the space currently used by the object in the CIL and -then calculating the increase or decrease in space used as the object is -relogged. This allows for a checkpoint reservation to only have to account for -log buffer metadata used such as log header records. - -However, even using a static reservation for just the log metadata is -problematic. Typically log record headers use at least 16KB of log space per -1MB of log space consumed (512 bytes per 32k) and the reservation needs to be -large enough to handle arbitrary sized checkpoint transactions. This -reservation needs to be made before the checkpoint is started, and we need to -be able to reserve the space without sleeping. For a 8MB checkpoint, we need a -reservation of around 150KB, which is a non-trivial amount of space. - -A static reservation needs to manipulate the log grant counters - we can take a -permanent reservation on the space, but we still need to make sure we refresh -the write reservation (the actual space available to the transaction) after -every checkpoint transaction completion. Unfortunately, if this space is not -available when required, then the regrant code will sleep waiting for it. - -The problem with this is that it can lead to deadlocks as we may need to commit -checkpoints to be able to free up log space (refer back to the description of -rolling transactions for an example of this). Hence we *must* always have -space available in the log if we are to use static reservations, and that is -very difficult and complex to arrange. It is possible to do, but there is a -simpler way. - -The simpler way of doing this is tracking the entire log space used by the -items in the CIL and using this to dynamically calculate the amount of log -space required by the log metadata. If this log metadata space changes as a -result of a transaction commit inserting a new memory buffer into the CIL, then -the difference in space required is removed from the transaction that causes -the change. Transactions at this level will *always* have enough space -available in their reservation for this as they have already reserved the -maximal amount of log metadata space they require, and such a delta reservation -will always be less than or equal to the maximal amount in the reservation. - -Hence we can grow the checkpoint transaction reservation dynamically as items -are added to the CIL and avoid the need for reserving and regranting log space -up front. This avoids deadlocks and removes a blocking point from the -checkpoint flush code. - -As mentioned early, transactions can't grow to more than half the size of the -log. Hence as part of the reservation growing, we need to also check the size -of the reservation against the maximum allowed transaction size. If we reach -the maximum threshold, we need to push the CIL to the log. This is effectively -a "background flush" and is done on demand. This is identical to -a CIL push triggered by a log force, only that there is no waiting for the -checkpoint commit to complete. This background push is checked and executed by -transaction commit code. - -If the transaction subsystem goes idle while we still have items in the CIL, -they will be flushed by the periodic log force issued by the xfssyncd. This log -force will push the CIL to disk, and if the transaction subsystem stays idle, -allow the idle log to be covered (effectively marked clean) in exactly the same -manner that is done for the existing logging method. A discussion point is -whether this log force needs to be done more frequently than the current rate -which is once every 30s. - - -Delayed Logging: Log Item Pinning ---------------------------------- - -Currently log items are pinned during transaction commit while the items are -still locked. This happens just after the items are formatted, though it could -be done any time before the items are unlocked. The result of this mechanism is -that items get pinned once for every transaction that is committed to the log -buffers. Hence items that are relogged in the log buffers will have a pin count -for every outstanding transaction they were dirtied in. When each of these -transactions is completed, they will unpin the item once. As a result, the item -only becomes unpinned when all the transactions complete and there are no -pending transactions. Thus the pinning and unpinning of a log item is symmetric -as there is a 1:1 relationship with transaction commit and log item completion. - -For delayed logging, however, we have an asymmetric transaction commit to -completion relationship. Every time an object is relogged in the CIL it goes -through the commit process without a corresponding completion being registered. -That is, we now have a many-to-one relationship between transaction commit and -log item completion. The result of this is that pinning and unpinning of the -log items becomes unbalanced if we retain the "pin on transaction commit, unpin -on transaction completion" model. - -To keep pin/unpin symmetry, the algorithm needs to change to a "pin on -insertion into the CIL, unpin on checkpoint completion". In other words, the -pinning and unpinning becomes symmetric around a checkpoint context. We have to -pin the object the first time it is inserted into the CIL - if it is already in -the CIL during a transaction commit, then we do not pin it again. Because there -can be multiple outstanding checkpoint contexts, we can still see elevated pin -counts, but as each checkpoint completes the pin count will retain the correct -value according to its context. - -Just to make matters slightly more complex, this checkpoint level context -for the pin count means that the pinning of an item must take place under the -CIL commit/flush lock. If we pin the object outside this lock, we cannot -guarantee which context the pin count is associated with. This is because of -the fact pinning the item is dependent on whether the item is present in the -current CIL or not. If we don't pin the CIL first before we check and pin the -object, we have a race with CIL being flushed between the check and the pin -(or not pinning, as the case may be). Hence we must hold the CIL flush/commit -lock to guarantee that we pin the items correctly. - -Delayed Logging: Concurrent Scalability ---------------------------------------- - -A fundamental requirement for the CIL is that accesses through transaction -commits must scale to many concurrent commits. The current transaction commit -code does not break down even when there are transactions coming from 2048 -processors at once. The current transaction code does not go any faster than if -there was only one CPU using it, but it does not slow down either. - -As a result, the delayed logging transaction commit code needs to be designed -for concurrency from the ground up. It is obvious that there are serialisation -points in the design - the three important ones are: - - 1. Locking out new transaction commits while flushing the CIL - 2. Adding items to the CIL and updating item space accounting - 3. Checkpoint commit ordering - -Looking at the transaction commit and CIL flushing interactions, it is clear -that we have a many-to-one interaction here. That is, the only restriction on -the number of concurrent transactions that can be trying to commit at once is -the amount of space available in the log for their reservations. The practical -limit here is in the order of several hundred concurrent transactions for a -128MB log, which means that it is generally one per CPU in a machine. - -The amount of time a transaction commit needs to hold out a flush is a -relatively long period of time - the pinning of log items needs to be done -while we are holding out a CIL flush, so at the moment that means it is held -across the formatting of the objects into memory buffers (i.e. while memcpy()s -are in progress). Ultimately a two pass algorithm where the formatting is done -separately to the pinning of objects could be used to reduce the hold time of -the transaction commit side. - -Because of the number of potential transaction commit side holders, the lock -really needs to be a sleeping lock - if the CIL flush takes the lock, we do not -want every other CPU in the machine spinning on the CIL lock. Given that -flushing the CIL could involve walking a list of tens of thousands of log -items, it will get held for a significant time and so spin contention is a -significant concern. Preventing lots of CPUs spinning doing nothing is the -main reason for choosing a sleeping lock even though nothing in either the -transaction commit or CIL flush side sleeps with the lock held. - -It should also be noted that CIL flushing is also a relatively rare operation -compared to transaction commit for asynchronous transaction workloads - only -time will tell if using a read-write semaphore for exclusion will limit -transaction commit concurrency due to cache line bouncing of the lock on the -read side. - -The second serialisation point is on the transaction commit side where items -are inserted into the CIL. Because transactions can enter this code -concurrently, the CIL needs to be protected separately from the above -commit/flush exclusion. It also needs to be an exclusive lock but it is only -held for a very short time and so a spin lock is appropriate here. It is -possible that this lock will become a contention point, but given the short -hold time once per transaction I think that contention is unlikely. - -The final serialisation point is the checkpoint commit record ordering code -that is run as part of the checkpoint commit and log force sequencing. The code -path that triggers a CIL flush (i.e. whatever triggers the log force) will enter -an ordering loop after writing all the log vectors into the log buffers but -before writing the commit record. This loop walks the list of committing -checkpoints and needs to block waiting for checkpoints to complete their commit -record write. As a result it needs a lock and a wait variable. Log force -sequencing also requires the same lock, list walk, and blocking mechanism to -ensure completion of checkpoints. - -These two sequencing operations can use the mechanism even though the -events they are waiting for are different. The checkpoint commit record -sequencing needs to wait until checkpoint contexts contain a commit LSN -(obtained through completion of a commit record write) while log force -sequencing needs to wait until previous checkpoint contexts are removed from -the committing list (i.e. they've completed). A simple wait variable and -broadcast wakeups (thundering herds) has been used to implement these two -serialisation queues. They use the same lock as the CIL, too. If we see too -much contention on the CIL lock, or too many context switches as a result of -the broadcast wakeups these operations can be put under a new spinlock and -given separate wait lists to reduce lock contention and the number of processes -woken by the wrong event. - - -Lifecycle Changes ------------------ - -The existing log item life cycle is as follows:: - - 1. Transaction allocate - 2. Transaction reserve - 3. Lock item - 4. Join item to transaction - If not already attached, - Allocate log item - Attach log item to owner item - Attach log item to transaction - 5. Modify item - Record modifications in log item - 6. Transaction commit - Pin item in memory - Format item into log buffer - Write commit LSN into transaction - Unlock item - Attach transaction to log buffer - - - - - 7. Transaction completion - Mark log item committed - Insert log item into AIL - Write commit LSN into log item - Unpin log item - 8. AIL traversal - Lock item - Mark log item clean - Flush item to disk - - - - 9. Log item removed from AIL - Moves log tail - Item unlocked - -Essentially, steps 1-6 operate independently from step 7, which is also -independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 -at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur -at the same time. If the log item is in the AIL or between steps 6 and 7 -and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 -are entered and completed is the object considered clean. - -With delayed logging, there are new steps inserted into the life cycle:: - - 1. Transaction allocate - 2. Transaction reserve - 3. Lock item - 4. Join item to transaction - If not already attached, - Allocate log item - Attach log item to owner item - Attach log item to transaction - 5. Modify item - Record modifications in log item - 6. Transaction commit - Pin item in memory if not pinned in CIL - Format item into log vector + buffer - Attach log vector and buffer to log item - Insert log item into CIL - Write CIL context sequence into transaction - Unlock item - - - - 7. CIL push - lock CIL flush - Chain log vectors and buffers together - Remove items from CIL - unlock CIL flush - write log vectors into log - sequence commit records - attach checkpoint context to log buffer - - - - - 8. Checkpoint completion - Mark log item committed - Insert item into AIL - Write commit LSN into log item - Unpin log item - 9. AIL traversal - Lock item - Mark log item clean - Flush item to disk - - 10. Log item removed from AIL - Moves log tail - Item unlocked - -From this, it can be seen that the only life cycle differences between the two -logging methods are in the middle of the life cycle - they still have the same -beginning and end and execution constraints. The only differences are in the -committing of the log items to the log itself and the completion processing. -Hence delayed logging should not introduce any constraints on log item -behaviour, allocation or freeing that don't already exist. - -As a result of this zero-impact "insertion" of delayed logging infrastructure -and the design of the internal structures to avoid on disk format changes, we -can basically switch between delayed logging and the existing mechanism with a -mount option. Fundamentally, there is no reason why the log manager would not -be able to swap methods automatically and transparently depending on load -characteristics, but this should not be necessary if delayed logging works as -designed. diff --git a/Documentation/filesystems/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs-maintainer-entry-profile.rst deleted file mode 100644 index 32b6ac4ca9d6..000000000000 --- a/Documentation/filesystems/xfs-maintainer-entry-profile.rst +++ /dev/null @@ -1,194 +0,0 @@ -XFS Maintainer Entry Profile -============================ - -Overview --------- -XFS is a well known high-performance filesystem in the Linux kernel. -The aim of this project is to provide and maintain a robust and -performant filesystem. - -Patches are generally merged to the for-next branch of the appropriate -git repository. -After a testing period, the for-next branch is merged to the master -branch. - -Kernel code are merged to the xfs-linux tree[0]. -Userspace code are merged to the xfsprogs tree[1]. -Test cases are merged to the xfstests tree[2]. -Ondisk format documentation are merged to the xfs-documentation tree[3]. - -All patchsets involving XFS *must* be cc'd in their entirety to the mailing -list linux-xfs@vger.kernel.org. - -Roles ------ -There are eight key roles in the XFS project. -A person can take on multiple roles, and a role can be filled by -multiple people. -Anyone taking on a role is advised to check in with themselves and -others on a regular basis about burnout. - -- **Outside Contributor**: Anyone who sends a patch but is not involved - in the XFS project on a regular basis. - These folks are usually people who work on other filesystems or - elsewhere in the kernel community. - -- **Developer**: Someone who is familiar with the XFS codebase enough to - write new code, documentation, and tests. - - Developers can often be found in the IRC channel mentioned by the ``C:`` - entry in the kernel MAINTAINERS file. - -- **Senior Developer**: A developer who is very familiar with at least - some part of the XFS codebase and/or other subsystems in the kernel. - These people collectively decide the long term goals of the project - and nudge the community in that direction. - They should help prioritize development and review work for each release - cycle. - - Senior developers tend to be more active participants in the IRC channel. - -- **Reviewer**: Someone (most likely also a developer) who reads code - submissions to decide: - - 0. Is the idea behind the contribution sound? - 1. Does the idea fit the goals of the project? - 2. Is the contribution designed correctly? - 3. Is the contribution polished? - 4. Can the contribution be tested effectively? - - Reviewers should identify themselves with an ``R:`` entry in the kernel - and fstests MAINTAINERS files. - -- **Testing Lead**: This person is responsible for setting the test - coverage goals of the project, negotiating with developers to decide - on new tests for new features, and making sure that developers and - release managers execute on the testing. - - The testing lead should identify themselves with an ``M:`` entry in - the XFS section of the fstests MAINTAINERS file. - -- **Bug Triager**: Someone who examines incoming bug reports in just - enough detail to identify the person to whom the report should be - forwarded. - - The bug triagers should identify themselves with a ``B:`` entry in - the kernel MAINTAINERS file. - -- **Release Manager**: This person merges reviewed patchsets into an - integration branch, tests the result locally, pushes the branch to a - public git repository, and sends pull requests further upstream. - The release manager is not expected to work on new feature patchsets. - If a developer and a reviewer fail to reach a resolution on some point, - the release manager must have the ability to intervene to try to drive a - resolution. - - The release manager should identify themselves with an ``M:`` entry in - the kernel MAINTAINERS file. - -- **Community Manager**: This person calls and moderates meetings of as many - XFS participants as they can get when mailing list discussions prove - insufficient for collective decisionmaking. - They may also serve as liaison between managers of the organizations - sponsoring work on any part of XFS. - -- **LTS Maintainer**: Someone who backports and tests bug fixes from - uptream to the LTS kernels. - There tend to be six separate LTS trees at any given time. - - The maintainer for a given LTS release should identify themselves with an - ``M:`` entry in the MAINTAINERS file for that LTS tree. - Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that - same file. - -Submission Checklist Addendum ------------------------------ -Please follow these additional rules when submitting to XFS: - -- Patches affecting only the filesystem itself should be based against - the latest -rc or the for-next branch. - These patches will be merged back to the for-next branch. - -- Authors of patches touching other subsystems need to coordinate with - the maintainers of XFS and the relevant subsystems to decide how to - proceed with a merge. - -- Any patchset changing XFS should be cc'd in its entirety to linux-xfs. - Do not send partial patchsets; that makes analysis of the broader - context of the changes unnecessarily difficult. - -- Anyone making kernel changes that have corresponding changes to the - userspace utilities should send the userspace changes as separate - patchsets immediately after the kernel patchsets. - -- Authors of bug fix patches are expected to use fstests[2] to perform - an A/B test of the patch to determine that there are no regressions. - When possible, a new regression test case should be written for - fstests. - -- Authors of new feature patchsets must ensure that fstests will have - appropriate functional and input corner-case test cases for the new - feature. - -- When implementing a new feature, it is strongly suggested that the - developers write a design document to answer the following questions: - - * **What** problem is this trying to solve? - - * **Who** will benefit from this solution, and **where** will they - access it? - - * **How** will this new feature work? This should touch on major data - structures and algorithms supporting the solution at a higher level - than code comments. - - * **What** userspace interfaces are necessary to build off of the new - features? - - * **How** will this work be tested to ensure that it solves the - problems laid out in the design document without causing new - problems? - - The design document should be committed in the kernel documentation - directory. - It may be omitted if the feature is already well known to the - community. - -- Patchsets for the new tests should be submitted as separate patchsets - immediately after the kernel and userspace code patchsets. - -- Changes to the on-disk format of XFS must be described in the ondisk - format document[3] and submitted as a patchset after the fstests - patchsets. - -- Patchsets implementing bug fixes and further code cleanups should put - the bug fixes at the beginning of the series to ease backporting. - -Key Release Cycle Dates ------------------------ -Bug fixes may be sent at any time, though the release manager may decide to -defer a patch when the next merge window is close. - -Code submissions targeting the next merge window should be sent between --rc1 and -rc6. -This gives the community time to review the changes, to suggest other changes, -and for the author to retest those changes. - -Code submissions also requiring changes to fs/iomap and targeting the -next merge window should be sent between -rc1 and -rc4. -This allows the broader kernel community adequate time to test the -infrastructure changes. - -Review Cadence --------------- -In general, please wait at least one week before pinging for feedback. -To find reviewers, either consult the MAINTAINERS file, or ask -developers that have Reviewed-by tags for XFS changes to take a look and -offer their opinion. - -References ----------- -| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/ -| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/ -| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ -| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/ diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs-online-fsck-design.rst deleted file mode 100644 index a0678101a7d0..000000000000 --- a/Documentation/filesystems/xfs-online-fsck-design.rst +++ /dev/null @@ -1,5315 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 -.. _xfs_online_fsck_design: - -.. - Mapping of heading styles within this document: - Heading 1 uses "====" above and below - Heading 2 uses "====" - Heading 3 uses "----" - Heading 4 uses "````" - Heading 5 uses "^^^^" - Heading 6 uses "~~~~" - Heading 7 uses "...." - - Sections are manually numbered because apparently that's what everyone - does in the kernel. - -====================== -XFS Online Fsck Design -====================== - -This document captures the design of the online filesystem check feature for -XFS. -The purpose of this document is threefold: - -- To help kernel distributors understand exactly what the XFS online fsck - feature is, and issues about which they should be aware. - -- To help people reading the code to familiarize themselves with the relevant - concepts and design points before they start digging into the code. - -- To help developers maintaining the system by capturing the reasons - supporting higher level decision making. - -As the online fsck code is merged, the links in this document to topic branches -will be replaced with links to code. - -This document is licensed under the terms of the GNU Public License, v2. -The primary author is Darrick J. Wong. - -This design document is split into seven parts. -Part 1 defines what fsck tools are and the motivations for writing a new one. -Parts 2 and 3 present a high level overview of how online fsck process works -and how it is tested to ensure correct functionality. -Part 4 discusses the user interface and the intended usage modes of the new -program. -Parts 5 and 6 show off the high level components and how they fit together, and -then present case studies of how each repair function actually works. -Part 7 sums up what has been discussed so far and speculates about what else -might be built atop online fsck. - -.. contents:: Table of Contents - :local: - -1. What is a Filesystem Check? -============================== - -A Unix filesystem has four main responsibilities: - -- Provide a hierarchy of names through which application programs can associate - arbitrary blobs of data for any length of time, - -- Virtualize physical storage media across those names, and - -- Retrieve the named data blobs at any time. - -- Examine resource usage. - -Metadata directly supporting these functions (e.g. files, directories, space -mappings) are sometimes called primary metadata. -Secondary metadata (e.g. reverse mapping and directory parent pointers) support -operations internal to the filesystem, such as internal consistency checking -and reorganization. -Summary metadata, as the name implies, condense information contained in -primary metadata for performance reasons. - -The filesystem check (fsck) tool examines all the metadata in a filesystem -to look for errors. -In addition to looking for obvious metadata corruptions, fsck also -cross-references different types of metadata records with each other to look -for inconsistencies. -People do not like losing data, so most fsck tools also contains some ability -to correct any problems found. -As a word of caution -- the primary goal of most Linux fsck tools is to restore -the filesystem metadata to a consistent state, not to maximize the data -recovered. -That precedent will not be challenged here. - -Filesystems of the 20th century generally lacked any redundancy in the ondisk -format, which means that fsck can only respond to errors by erasing files until -errors are no longer detected. -More recent filesystem designs contain enough redundancy in their metadata that -it is now possible to regenerate data structures when non-catastrophic errors -occur; this capability aids both strategies. - -+--------------------------------------------------------------------------+ -| **Note**: | -+--------------------------------------------------------------------------+ -| System administrators avoid data loss by increasing the number of | -| separate storage systems through the creation of backups; and they avoid | -| downtime by increasing the redundancy of each storage system through the | -| creation of RAID arrays. | -| fsck tools address only the first problem. | -+--------------------------------------------------------------------------+ - -TLDR; Show Me the Code! ------------------------ - -Code is posted to the kernel.org git trees as follows: -`kernel changes `_, -`userspace changes `_, and -`QA test changes `_. -Each kernel patchset adding an online repair function will use the same branch -name across the kernel, xfsprogs, and fstests git repos. - -Existing Tools --------------- - -The online fsck tool described here will be the third tool in the history of -XFS (on Linux) to check and repair filesystems. -Two programs precede it: - -The first program, ``xfs_check``, was created as part of the XFS debugger -(``xfs_db``) and can only be used with unmounted filesystems. -It walks all metadata in the filesystem looking for inconsistencies in the -metadata, though it lacks any ability to repair what it finds. -Due to its high memory requirements and inability to repair things, this -program is now deprecated and will not be discussed further. - -The second program, ``xfs_repair``, was created to be faster and more robust -than the first program. -Like its predecessor, it can only be used with unmounted filesystems. -It uses extent-based in-memory data structures to reduce memory consumption, -and tries to schedule readahead IO appropriately to reduce I/O waiting time -while it scans the metadata of the entire filesystem. -The most important feature of this tool is its ability to respond to -inconsistencies in file metadata and directory tree by erasing things as needed -to eliminate problems. -Space usage metadata are rebuilt from the observed file metadata. - -Problem Statement ------------------ - -The current XFS tools leave several problems unsolved: - -1. **User programs** suddenly **lose access** to the filesystem when unexpected - shutdowns occur as a result of silent corruptions in the metadata. - These occur **unpredictably** and often without warning. - -2. **Users** experience a **total loss of service** during the recovery period - after an **unexpected shutdown** occurs. - -3. **Users** experience a **total loss of service** if the filesystem is taken - offline to **look for problems** proactively. - -4. **Data owners** cannot **check the integrity** of their stored data without - reading all of it. - This may expose them to substantial billing costs when a linear media scan - performed by the storage system administrator might suffice. - -5. **System administrators** cannot **schedule** a maintenance window to deal - with corruptions if they **lack the means** to assess filesystem health - while the filesystem is online. - -6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem - health when doing so requires **manual intervention** and downtime. - -7. **Users** can be tricked into **doing things they do not desire** when - malicious actors **exploit quirks of Unicode** to place misleading names - in directories. - -Given this definition of the problems to be solved and the actors who would -benefit, the proposed solution is a third fsck tool that acts on a running -filesystem. - -This new third program has three components: an in-kernel facility to check -metadata, an in-kernel facility to repair metadata, and a userspace driver -program to drive fsck activity on a live filesystem. -``xfs_scrub`` is the name of the driver program. -The rest of this document presents the goals and use cases of the new fsck -tool, describes its major design points in connection to those goals, and -discusses the similarities and differences with existing tools. - -+--------------------------------------------------------------------------+ -| **Note**: | -+--------------------------------------------------------------------------+ -| Throughout this document, the existing offline fsck tool can also be | -| referred to by its current name "``xfs_repair``". | -| The userspace driver program for the new online fsck tool can be | -| referred to as "``xfs_scrub``". | -| The kernel portion of online fsck that validates metadata is called | -| "online scrub", and portion of the kernel that fixes metadata is called | -| "online repair". | -+--------------------------------------------------------------------------+ - -The naming hierarchy is broken up into objects known as directories and files -and the physical space is split into pieces known as allocation groups. -Sharding enables better performance on highly parallel systems and helps to -contain the damage when corruptions occur. -The division of the filesystem into principal objects (allocation groups and -inodes) means that there are ample opportunities to perform targeted checks and -repairs on a subset of the filesystem. - -While this is going on, other parts continue processing IO requests. -Even if a piece of filesystem metadata can only be regenerated by scanning the -entire system, the scan can still be done in the background while other file -operations continue. - -In summary, online fsck takes advantage of resource sharding and redundant -metadata to enable targeted checking and repair operations while the system -is running. -This capability will be coupled to automatic system management so that -autonomous self-healing of XFS maximizes service availability. - -2. Theory of Operation -====================== - -Because it is necessary for online fsck to lock and scan live metadata objects, -online fsck consists of three separate code components. -The first is the userspace driver program ``xfs_scrub``, which is responsible -for identifying individual metadata items, scheduling work items for them, -reacting to the outcomes appropriately, and reporting results to the system -administrator. -The second and third are in the kernel, which implements functions to check -and repair each type of online fsck work item. - -+------------------------------------------------------------------+ -| **Note**: | -+------------------------------------------------------------------+ -| For brevity, this document shortens the phrase "online fsck work | -| item" to "scrub item". | -+------------------------------------------------------------------+ - -Scrub item types are delineated in a manner consistent with the Unix design -philosophy, which is to say that each item should handle one aspect of a -metadata structure, and handle it well. - -Scope ------ - -In principle, online fsck should be able to check and to repair everything that -the offline fsck program can handle. -However, online fsck cannot be running 100% of the time, which means that -latent errors may creep in after a scrub completes. -If these errors cause the next mount to fail, offline fsck is the only -solution. -This limitation means that maintenance of the offline fsck tool will continue. -A second limitation of online fsck is that it must follow the same resource -sharing and lock acquisition rules as the regular filesystem. -This means that scrub cannot take *any* shortcuts to save time, because doing -so could lead to concurrency problems. -In other words, online fsck is not a complete replacement for offline fsck, and -a complete run of online fsck may take longer than online fsck. -However, both of these limitations are acceptable tradeoffs to satisfy the -different motivations of online fsck, which are to **minimize system downtime** -and to **increase predictability of operation**. - -.. _scrubphases: - -Phases of Work --------------- - -The userspace driver program ``xfs_scrub`` splits the work of checking and -repairing an entire filesystem into seven phases. -Each phase concentrates on checking specific types of scrub items and depends -on the success of all previous phases. -The seven phases are as follows: - -1. Collect geometry information about the mounted filesystem and computer, - discover the online fsck capabilities of the kernel, and open the - underlying storage devices. - -2. Check allocation group metadata, all realtime volume metadata, and all quota - files. - Each metadata structure is scheduled as a separate scrub item. - If corruption is found in the inode header or inode btree and ``xfs_scrub`` - is permitted to perform repairs, then those scrub items are repaired to - prepare for phase 3. - Repairs are implemented by using the information in the scrub item to - resubmit the kernel scrub call with the repair flag enabled; this is - discussed in the next section. - Optimizations and all other repairs are deferred to phase 4. - -3. Check all metadata of every file in the filesystem. - Each metadata structure is also scheduled as a separate scrub item. - If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, - and there were no problems detected during phase 2, then those scrub items - are repaired immediately. - Optimizations, deferred repairs, and unsuccessful repairs are deferred to - phase 4. - -4. All remaining repairs and scheduled optimizations are performed during this - phase, if the caller permits them. - Before starting repairs, the summary counters are checked and any necessary - repairs are performed so that subsequent repairs will not fail the resource - reservation step due to wildly incorrect summary counters. - Unsuccessful repairs are requeued as long as forward progress on repairs is - made somewhere in the filesystem. - Free space in the filesystem is trimmed at the end of phase 4 if the - filesystem is clean. - -5. By the start of this phase, all primary and secondary filesystem metadata - must be correct. - Summary counters such as the free space counts and quota resource counts - are checked and corrected. - Directory entry names and extended attribute names are checked for - suspicious entries such as control characters or confusing Unicode sequences - appearing in names. - -6. If the caller asks for a media scan, read all allocated and written data - file extents in the filesystem. - The ability to use hardware-assisted data file integrity checking is new - to online fsck; neither of the previous tools have this capability. - If media errors occur, they will be mapped to the owning files and reported. - -7. Re-check the summary counters and presents the caller with a summary of - space usage and file counts. - -This allocation of responsibilities will be :ref:`revisited ` -later in this document. - -Steps for Each Scrub Item -------------------------- - -The kernel scrub code uses a three-step strategy for checking and repairing -the one aspect of a metadata object represented by a scrub item: - -1. The scrub item of interest is checked for corruptions; opportunities for - optimization; and for values that are directly controlled by the system - administrator but look suspicious. - If the item is not corrupt or does not need optimization, resource are - released and the positive scan results are returned to userspace. - If the item is corrupt or could be optimized but the caller does not permit - this, resources are released and the negative scan results are returned to - userspace. - Otherwise, the kernel moves on to the second step. - -2. The repair function is called to rebuild the data structure. - Repair functions generally choose rebuild a structure from other metadata - rather than try to salvage the existing structure. - If the repair fails, the scan results from the first step are returned to - userspace. - Otherwise, the kernel moves on to the third step. - -3. In the third step, the kernel runs the same checks over the new metadata - item to assess the efficacy of the repairs. - The results of the reassessment are returned to userspace. - -Classification of Metadata --------------------------- - -Each type of metadata object (and therefore each type of scrub item) is -classified as follows: - -Primary Metadata -```````````````` - -Metadata structures in this category should be most familiar to filesystem -users either because they are directly created by the user or they index -objects created by the user -Most filesystem objects fall into this class: - -- Free space and reference count information - -- Inode records and indexes - -- Storage mapping information for file data - -- Directories - -- Extended attributes - -- Symbolic links - -- Quota limits - -Scrub obeys the same rules as regular filesystem accesses for resource and lock -acquisition. - -Primary metadata objects are the simplest for scrub to process. -The principal filesystem object (either an allocation group or an inode) that -owns the item being scrubbed is locked to guard against concurrent updates. -The check function examines every record associated with the type for obvious -errors and cross-references healthy records against other metadata to look for -inconsistencies. -Repairs for this class of scrub item are simple, since the repair function -starts by holding all the resources acquired in the previous step. -The repair function scans available metadata as needed to record all the -observations needed to complete the structure. -Next, it stages the observations in a new ondisk structure and commits it -atomically to complete the repair. -Finally, the storage from the old data structure are carefully reaped. - -Because ``xfs_scrub`` locks a primary object for the duration of the repair, -this is effectively an offline repair operation performed on a subset of the -filesystem. -This minimizes the complexity of the repair code because it is not necessary to -handle concurrent updates from other threads, nor is it necessary to access -any other part of the filesystem. -As a result, indexed structures can be rebuilt very quickly, and programs -trying to access the damaged structure will be blocked until repairs complete. -The only infrastructure needed by the repair code are the staging area for -observations and a means to write new structures to disk. -Despite these limitations, the advantage that online repair holds is clear: -targeted work on individual shards of the filesystem avoids total loss of -service. - -This mechanism is described in section 2.1 ("Off-Line Algorithm") of -V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction -Algorithms" `_, -*Extending Database Technology*, pp. 293-309, 1992. - -Most primary metadata repair functions stage their intermediate results in an -in-memory array prior to formatting the new ondisk structure, which is very -similar to the list-based algorithm discussed in section 2.3 ("List-Based -Algorithms") of Srinivasan. -However, any data structure builder that maintains a resource lock for the -duration of the repair is *always* an offline algorithm. - -.. _secondary_metadata: - -Secondary Metadata -`````````````````` - -Metadata structures in this category reflect records found in primary metadata, -but are only needed for online fsck or for reorganization of the filesystem. - -Secondary metadata include: - -- Reverse mapping information - -- Directory parent pointers - -This class of metadata is difficult for scrub to process because scrub attaches -to the secondary object but needs to check primary metadata, which runs counter -to the usual order of resource acquisition. -Frequently, this means that full filesystems scans are necessary to rebuild the -metadata. -Check functions can be limited in scope to reduce runtime. -Repairs, however, require a full scan of primary metadata, which can take a -long time to complete. -Under these conditions, ``xfs_scrub`` cannot lock resources for the entire -duration of the repair. - -Instead, repair functions set up an in-memory staging structure to store -observations. -Depending on the requirements of the specific repair function, the staging -index will either have the same format as the ondisk structure or a design -specific to that repair function. -The next step is to release all locks and start the filesystem scan. -When the repair scanner needs to record an observation, the staging data are -locked long enough to apply the update. -While the filesystem scan is in progress, the repair function hooks the -filesystem so that it can apply pending filesystem updates to the staging -information. -Once the scan is done, the owning object is re-locked, the live data is used to -write a new ondisk structure, and the repairs are committed atomically. -The hooks are disabled and the staging staging area is freed. -Finally, the storage from the old data structure are carefully reaped. - -Introducing concurrency helps online repair avoid various locking problems, but -comes at a high cost to code complexity. -Live filesystem code has to be hooked so that the repair function can observe -updates in progress. -The staging area has to become a fully functional parallel structure so that -updates can be merged from the hooks. -Finally, the hook, the filesystem scan, and the inode locking model must be -sufficiently well integrated that a hook event can decide if a given update -should be applied to the staging structure. - -In theory, the scrub implementation could apply these same techniques for -primary metadata, but doing so would make it massively more complex and less -performant. -Programs attempting to access the damaged structures are not blocked from -operation, which may cause application failure or an unplanned filesystem -shutdown. - -Inspiration for the secondary metadata repair strategy was drawn from section -2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") -and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for -Creating Indexes for Very Large Tables Without Quiescing Updates" -`_, 1992. - -The sidecar index mentioned above bears some resemblance to the side file -method mentioned in Srinivasan and Mohan. -Their method consists of an index builder that extracts relevant record data to -build the new structure as quickly as possible; and an auxiliary structure that -captures all updates that would be committed to the index by other threads were -the new index already online. -After the index building scan finishes, the updates recorded in the side file -are applied to the new index. -To avoid conflicts between the index builder and other writer threads, the -builder maintains a publicly visible cursor that tracks the progress of the -scan through the record space. -To avoid duplication of work between the side file and the index builder, side -file updates are elided when the record ID for the update is greater than the -cursor position within the record ID space. - -To minimize changes to the rest of the codebase, XFS online repair keeps the -replacement index hidden until it's completely ready to go. -In other words, there is no attempt to expose the keyspace of the new index -while repair is running. -The complexity of such an approach would be very high and perhaps more -appropriate to building *new* indices. - -**Future Work Question**: Can the full scan and live update code used to -facilitate a repair also be used to implement a comprehensive check? - -*Answer*: In theory, yes. Check would be much stronger if each scrub function -employed these live scans to build a shadow copy of the metadata and then -compared the shadow records to the ondisk records. -However, doing that is a fair amount more work than what the checking functions -do now. -The live scans and hooks were developed much later. -That in turn increases the runtime of those scrub functions. - -Summary Information -``````````````````` - -Metadata structures in this last category summarize the contents of primary -metadata records. -These are often used to speed up resource usage queries, and are many times -smaller than the primary metadata which they represent. - -Examples of summary information include: - -- Summary counts of free space and inodes - -- File link counts from directories - -- Quota resource usage counts - -Check and repair require full filesystem scans, but resource and lock -acquisition follow the same paths as regular filesystem accesses. - -The superblock summary counters have special requirements due to the underlying -implementation of the incore counters, and will be treated separately. -Check and repair of the other types of summary counters (quota resource counts -and file link counts) employ the same filesystem scanning and hooking -techniques as outlined above, but because the underlying data are sets of -integer counters, the staging data need not be a fully functional mirror of the -ondisk structure. - -Inspiration for quota and file link count repair strategies were drawn from -sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View -Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views -and Their Indexes" -`_, 2011. - -Since quotas are non-negative integer counts of resource usage, online -quotacheck can use the incremental view deltas described in section 2.14 to -track pending changes to the block and inode usage counts in each transaction, -and commit those changes to a dquot side file when the transaction commits. -Delta tracking is necessary for dquots because the index builder scans inodes, -whereas the data structure being rebuilt is an index of dquots. -Link count checking combines the view deltas and commit step into one because -it sets attributes of the objects being scanned instead of writing them to a -separate data structure. -Each online fsck function will be discussed as case studies later in this -document. - -Risk Management ---------------- - -During the development of online fsck, several risk factors were identified -that may make the feature unsuitable for certain distributors and users. -Steps can be taken to mitigate or eliminate those risks, though at a cost to -functionality. - -- **Decreased performance**: Adding metadata indices to the filesystem - increases the time cost of persisting changes to disk, and the reverse space - mapping and directory parent pointers are no exception. - System administrators who require the maximum performance can disable the - reverse mapping features at format time, though this choice dramatically - reduces the ability of online fsck to find inconsistencies and repair them. - -- **Incorrect repairs**: As with all software, there might be defects in the - software that result in incorrect repairs being written to the filesystem. - Systematic fuzz testing (detailed in the next section) is employed by the - authors to find bugs early, but it might not catch everything. - The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` - and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to - accept this risk. - The xfsprogs build system has a configure option (``--enable-scrub=no``) that - disables building of the ``xfs_scrub`` binary, though this is not a risk - mitigation if the kernel functionality remains enabled. - -- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be - repairable. - If the keyspaces of several metadata indices overlap in some manner but a - coherent narrative cannot be formed from records collected, then the repair - fails. - To reduce the chance that a repair will fail with a dirty transaction and - render the filesystem unusable, the online repair functions have been - designed to stage and validate all new records before committing the new - structure. - -- **Misbehavior**: Online fsck requires many privileges -- raw IO to block - devices, opening files by handle, ignoring Unix discretionary access control, - and the ability to perform administrative changes. - Running this automatically in the background scares people, so the systemd - background service is configured to run with only the privileges required. - Obviously, this cannot address certain problems like the kernel crashing or - deadlocking, but it should be sufficient to prevent the scrub process from - escaping and reconfiguring the system. - The cron job does not have this protection. - -- **Fuzz Kiddiez**: There are many people now who seem to think that running - automated fuzz testing of ondisk artifacts to find mischievous behavior and - spraying exploit code onto the public mailing list for instant zero-day - disclosure is somehow of some social benefit. - In the view of this author, the benefit is realized only when the fuzz - operators help to **fix** the flaws, but this opinion apparently is not - widely shared among security "researchers". - The XFS maintainers' continuing ability to manage these events presents an - ongoing risk to the stability of the development process. - Automated testing should front-load some of the risk while the feature is - considered EXPERIMENTAL. - -Many of these risks are inherent to software programming. -Despite this, it is hoped that this new functionality will prove useful in -reducing unexpected downtime. - -3. Testing Plan -=============== - -As stated before, fsck tools have three main goals: - -1. Detect inconsistencies in the metadata; - -2. Eliminate those inconsistencies; and - -3. Minimize further loss of data. - -Demonstrations of correct operation are necessary to build users' confidence -that the software behaves within expectations. -Unfortunately, it was not really feasible to perform regular exhaustive testing -of every aspect of a fsck tool until the introduction of low-cost virtual -machines with high-IOPS storage. -With ample hardware availability in mind, the testing strategy for the online -fsck project involves differential analysis against the existing fsck tools and -systematic testing of every attribute of every type of metadata object. -Testing can be split into four major categories, as discussed below. - -Integrated Testing with fstests -------------------------------- - -The primary goal of any free software QA effort is to make testing as -inexpensive and widespread as possible to maximize the scaling advantages of -community. -In other words, testing should maximize the breadth of filesystem configuration -scenarios and hardware setups. -This improves code quality by enabling the authors of online fsck to find and -fix bugs early, and helps developers of new features to find integration -issues earlier in their development effort. - -The Linux filesystem community shares a common QA testing suite, -`fstests `_, for -functional and regression testing. -Even before development work began on online fsck, fstests (when run on XFS) -would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and -scratch filesystems between each test. -This provides a level of assurance that the kernel and the fsck tools stay in -alignment about what constitutes consistent metadata. -During development of the online checking code, fstests was modified to run -``xfs_scrub -n`` between each test to ensure that the new checking code -produces the same results as the two existing fsck tools. - -To start development of online repair, fstests was modified to run -``xfs_repair`` to rebuild the filesystem's metadata indices between tests. -This ensures that offline repair does not crash, leave a corrupt filesystem -after it exists, or trigger complaints from the online check. -This also established a baseline for what can and cannot be repaired offline. -To complete the first phase of development of online repair, fstests was -modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. -This enables a comparison of the effectiveness of online repair as compared to -the existing offline repair tools. - -General Fuzz Testing of Metadata Blocks ---------------------------------------- - -XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. - -Before development of online fsck even began, a set of fstests were created -to test the rather common fault that entire metadata blocks get corrupted. -This required the creation of fstests library code that can create a filesystem -containing every possible type of metadata object. -Next, individual test cases were created to create a test filesystem, identify -a single block of a specific type of metadata object, trash it with the -existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a -particular metadata validation strategy. - -This earlier test suite enabled XFS developers to test the ability of the -in-kernel validation functions and the ability of the offline fsck tool to -detect and eliminate the inconsistent metadata. -This part of the test suite was extended to cover online fsck in exactly the -same manner. - -In other words, for a given fstests filesystem configuration: - -* For each metadata object existing on the filesystem: - - * Write garbage to it - - * Test the reactions of: - - 1. The kernel verifiers to stop obviously bad metadata - 2. Offline repair (``xfs_repair``) to detect and fix - 3. Online repair (``xfs_scrub``) to detect and fix - -Targeted Fuzz Testing of Metadata Records ------------------------------------------ - -The testing plan for online fsck includes extending the existing fs testing -infrastructure to provide a much more powerful facility: targeted fuzz testing -of every metadata field of every metadata object in the filesystem. -``xfs_db`` can modify every field of every metadata structure in every -block in the filesystem to simulate the effects of memory corruption and -software bugs. -Given that fstests already contains the ability to create a filesystem -containing every metadata format known to the filesystem, ``xfs_db`` can be -used to perform exhaustive fuzz testing! - -For a given fstests filesystem configuration: - -* For each metadata object existing on the filesystem... - - * For each record inside that metadata object... - - * For each field inside that record... - - * For each conceivable type of transformation that can be applied to a bit field... - - 1. Clear all bits - 2. Set all bits - 3. Toggle the most significant bit - 4. Toggle the middle bit - 5. Toggle the least significant bit - 6. Add a small quantity - 7. Subtract a small quantity - 8. Randomize the contents - - * ...test the reactions of: - - 1. The kernel verifiers to stop obviously bad metadata - 2. Offline checking (``xfs_repair -n``) - 3. Offline repair (``xfs_repair``) - 4. Online checking (``xfs_scrub -n``) - 5. Online repair (``xfs_scrub``) - 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) - -This is quite the combinatoric explosion! - -Fortunately, having this much test coverage makes it easy for XFS developers to -check the responses of XFS' fsck tools. -Since the introduction of the fuzz testing framework, these tests have been -used to discover incorrect repair code and missing functionality for entire -classes of metadata objects in ``xfs_repair``. -The enhanced testing was used to finalize the deprecation of ``xfs_check`` by -confirming that ``xfs_repair`` could detect at least as many corruptions as -the older tool. - -These tests have been very valuable for ``xfs_scrub`` in the same ways -- they -allow the online fsck developers to compare online fsck against offline fsck, -and they enable XFS developers to find deficiencies in the code base. - -Proposed patchsets include -`general fuzzer improvements -`_, -`fuzzing baselines -`_, -and `improvements in fuzz testing comprehensiveness -`_. - -Stress Testing --------------- - -A unique requirement to online fsck is the ability to operate on a filesystem -concurrently with regular workloads. -Although it is of course impossible to run ``xfs_scrub`` with *zero* observable -impact on the running system, the online repair code should never introduce -inconsistencies into the filesystem metadata, and regular workloads should -never notice resource starvation. -To verify that these conditions are being met, fstests has been enhanced in -the following ways: - -* For each scrub item type, create a test to exercise checking that item type - while running ``fsstress``. -* For each scrub item type, create a test to exercise repairing that item type - while running ``fsstress``. -* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole - filesystem doesn't cause problems. -* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that - force-repairing the whole filesystem doesn't cause problems. -* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while - freezing and thawing the filesystem. -* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while - remounting the filesystem read-only and read-write. -* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) - -Success is defined by the ability to run all of these tests without observing -any unexpected filesystem shutdowns due to corrupted metadata, kernel hang -check warnings, or any other sort of mischief. - -Proposed patchsets include `general stress testing -`_ -and the `evolution of existing per-function stress testing -`_. - -4. User Interface -================= - -The primary user of online fsck is the system administrator, just like offline -repair. -Online fsck presents two modes of operation to administrators: -A foreground CLI process for online fsck on demand, and a background service -that performs autonomous checking and repair. - -Checking on Demand ------------------- - -For administrators who want the absolute freshest information about the -metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on -a command line. -The program checks every piece of metadata in the filesystem while the -administrator waits for the results to be reported, just like the existing -``xfs_repair`` tool. -Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` -option to increase the verbosity of the information reported. - -A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error -correction capabilities of the hardware to check data file contents. -The media scan is not enabled by default because it may dramatically increase -program runtime and consume a lot of bandwidth on older storage hardware. - -The output of a foreground invocation is captured in the system log. - -The ``xfs_scrub_all`` program walks the list of mounted filesystems and -initiates ``xfs_scrub`` for each of them in parallel. -It serializes scans for any filesystems that resolve to the same top level -kernel block device to prevent resource overconsumption. - -Background Service ------------------- - -To reduce the workload of system administrators, the ``xfs_scrub`` package -provides a suite of `systemd `_ timers and services that -run online fsck automatically on weekends by default. -The background service configures scrub to run with as little privilege as -possible, the lowest CPU and IO priority, and in a CPU-constrained single -threaded mode. -This can be tuned by the systemd administrator at any time to suit the latency -and throughput requirements of customer workloads. - -The output of the background service is also captured in the system log. -If desired, reports of failures (either due to inconsistencies or mere runtime -errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment -variable in the following service files: - -* ``xfs_scrub_fail@.service`` -* ``xfs_scrub_media_fail@.service`` -* ``xfs_scrub_all_fail.service`` - -The decision to enable the background scan is left to the system administrator. -This can be done by enabling either of the following services: - -* ``xfs_scrub_all.timer`` on systemd systems -* ``xfs_scrub_all.cron`` on non-systemd systems - -This automatic weekly scan is configured out of the box to perform an -additional media scan of all file data once per month. -This is less foolproof than, say, storing file data block checksums, but much -more performant if application software provides its own integrity checking, -redundancy can be provided elsewhere above the filesystem, or the storage -device's integrity guarantees are deemed sufficient. - -The systemd unit file definitions have been subjected to a security audit -(as of systemd 249) to ensure that the xfs_scrub processes have as little -access to the rest of the system as possible. -This was performed via ``systemd-analyze security``, after which privileges -were restricted to the minimum required, sandboxing was set up to the maximal -extent possible with sandboxing and system call filtering; and access to the -filesystem tree was restricted to the minimum needed to start the program and -access the filesystem being scanned. -The service definition files restrict CPU usage to 80% of one CPU core, and -apply as nice of a priority to IO and CPU scheduling as possible. -This measure was taken to minimize delays in the rest of the filesystem. -No such hardening has been performed for the cron job. - -Proposed patchset: -`Enabling the xfs_scrub background service -`_. - -Health Reporting ----------------- - -XFS caches a summary of each filesystem's health status in memory. -The information is updated whenever ``xfs_scrub`` is run, or whenever -inconsistencies are detected in the filesystem metadata during regular -operations. -System administrators should use the ``health`` command of ``xfs_spaceman`` to -download this information into a human-readable format. -If problems have been observed, the administrator can schedule a reduced -service window to run the online repair tool to correct the problem. -Failing that, the administrator can decide to schedule a maintenance window to -run the traditional offline repair tool to correct the problem. - -**Future Work Question**: Should the health reporting integrate with the new -inotify fs error notification system? -Would it be helpful for sysadmins to have a daemon to listen for corruption -notifications and initiate a repair? - -*Answer*: These questions remain unanswered, but should be a part of the -conversation with early adopters and potential downstream users of XFS. - -Proposed patchsets include -`wiring up health reports to correction returns -`_ -and -`preservation of sickness info during memory reclaim -`_. - -5. Kernel Algorithms and Data Structures -======================================== - -This section discusses the key algorithms and data structures of the kernel -code that provide the ability to check and repair metadata while the system -is running. -The first chapters in this section reveal the pieces that provide the -foundation for checking metadata. -The remainder of this section presents the mechanisms through which XFS -regenerates itself. - -Self Describing Metadata ------------------------- - -Starting with XFS version 5 in 2012, XFS updated the format of nearly every -ondisk block header to record a magic number, a checksum, a universally -"unique" identifier (UUID), an owner code, the ondisk address of the block, -and a log sequence number. -When loading a block buffer from disk, the magic number, UUID, owner, and -ondisk address confirm that the retrieved block matches the specific owner of -the current filesystem, and that the information contained in the block is -supposed to be found at the ondisk address. -The first three components enable checking tools to disregard alleged metadata -that doesn't belong to the filesystem, and the fourth component enables the -filesystem to detect lost writes. - -Whenever a file system operation modifies a block, the change is submitted -to the log as part of a transaction. -The log then processes these transactions marking them done once they are -safely persisted to storage. -The logging code maintains the checksum and the log sequence number of the last -transactional update. -Checksums are useful for detecting torn writes and other discrepancies that can -be introduced between the computer and its storage devices. -Sequence number tracking enables log recovery to avoid applying out of date -log updates to the filesystem. - -These two features improve overall runtime resiliency by providing a means for -the filesystem to detect obvious corruption when reading metadata blocks from -disk, but these buffer verifiers cannot provide any consistency checking -between metadata structures. - -For more information, please see the documentation for -Documentation/filesystems/xfs-self-describing-metadata.rst - -Reverse Mapping ---------------- - -The original design of XFS (circa 1993) is an improvement upon 1980s Unix -filesystem design. -In those days, storage density was expensive, CPU time was scarce, and -excessive seek time could kill performance. -For performance reasons, filesystem authors were reluctant to add redundancy to -the filesystem, even at the cost of data integrity. -Filesystems designers in the early 21st century choose different strategies to -increase internal redundancy -- either storing nearly identical copies of -metadata, or more space-efficient encoding techniques. - -For XFS, a different redundancy strategy was chosen to modernize the design: -a secondary space usage index that maps allocated disk extents back to their -owners. -By adding a new index, the filesystem retains most of its ability to scale -well to heavily threaded workloads involving large datasets, since the primary -file metadata (the directory tree, the file block map, and the allocation -groups) remain unchanged. -Like any system that improves redundancy, the reverse-mapping feature increases -overhead costs for space mapping activities. -However, it has two critical advantages: first, the reverse index is key to -enabling online fsck and other requested functionality such as free space -defragmentation, better media failure reporting, and filesystem shrinking. -Second, the different ondisk storage format of the reverse mapping btree -defeats device-level deduplication because the filesystem requires real -redundancy. - -+--------------------------------------------------------------------------+ -| **Sidebar**: | -+--------------------------------------------------------------------------+ -| A criticism of adding the secondary index is that it does nothing to | -| improve the robustness of user data storage itself. | -| This is a valid point, but adding a new index for file data block | -| checksums increases write amplification by turning data overwrites into | -| copy-writes, which age the filesystem prematurely. | -| In keeping with thirty years of precedent, users who want file data | -| integrity can supply as powerful a solution as they require. | -| As for metadata, the complexity of adding a new secondary index of space | -| usage is much less than adding volume management and storage device | -| mirroring to XFS itself. | -| Perfection of RAID and volume management are best left to existing | -| layers in the kernel. | -+--------------------------------------------------------------------------+ - -The information captured in a reverse space mapping record is as follows: - -.. code-block:: c - - struct xfs_rmap_irec { - xfs_agblock_t rm_startblock; /* extent start block */ - xfs_extlen_t rm_blockcount; /* extent length */ - uint64_t rm_owner; /* extent owner */ - uint64_t rm_offset; /* offset within the owner */ - unsigned int rm_flags; /* state flags */ - }; - -The first two fields capture the location and size of the physical space, -in units of filesystem blocks. -The owner field tells scrub which metadata structure or file inode have been -assigned this space. -For space allocated to files, the offset field tells scrub where the space was -mapped within the file fork. -Finally, the flags field provides extra information about the space usage -- -is this an attribute fork extent? A file mapping btree extent? Or an -unwritten data extent? - -Online filesystem checking judges the consistency of each primary metadata -record by comparing its information against all other space indices. -The reverse mapping index plays a key role in the consistency checking process -because it contains a centralized alternate copy of all space allocation -information. -Program runtime and ease of resource acquisition are the only real limits to -what online checking can consult. -For example, a file data extent mapping can be checked against: - -* The absence of an entry in the free space information. -* The absence of an entry in the inode index. -* The absence of an entry in the reference count data if the file is not - marked as having shared extents. -* The correspondence of an entry in the reverse mapping information. - -There are several observations to make about reverse mapping indices: - -1. Reverse mappings can provide a positive affirmation of correctness if any of - the above primary metadata are in doubt. - The checking code for most primary metadata follows a path similar to the - one outlined above. - -2. Proving the consistency of secondary metadata with the primary metadata is - difficult because that requires a full scan of all primary space metadata, - which is very time intensive. - For example, checking a reverse mapping record for a file extent mapping - btree block requires locking the file and searching the entire btree to - confirm the block. - Instead, scrub relies on rigorous cross-referencing during the primary space - mapping structure checks. - -3. Consistency scans must use non-blocking lock acquisition primitives if the - required locking order is not the same order used by regular filesystem - operations. - For example, if the filesystem normally takes a file ILOCK before taking - the AGF buffer lock but scrub wants to take a file ILOCK while holding - an AGF buffer lock, scrub cannot block on that second acquisition. - This means that forward progress during this part of a scan of the reverse - mapping data cannot be guaranteed if system load is heavy. - -In summary, reverse mappings play a key role in reconstruction of primary -metadata. -The details of how these records are staged, written to disk, and committed -into the filesystem are covered in subsequent sections. - -Checking and Cross-Referencing ------------------------------- - -The first step of checking a metadata structure is to examine every record -contained within the structure and its relationship with the rest of the -system. -XFS contains multiple layers of checking to try to prevent inconsistent -metadata from wreaking havoc on the system. -Each of these layers contributes information that helps the kernel to make -three decisions about the health of a metadata structure: - -- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? -- Is this structure inconsistent with the rest of the system - (``XFS_SCRUB_OFLAG_XCORRUPT``) ? -- Is there so much damage around the filesystem that cross-referencing is not - possible (``XFS_SCRUB_OFLAG_XFAIL``) ? -- Can the structure be optimized to improve performance or reduce the size of - metadata (``XFS_SCRUB_OFLAG_PREEN``) ? -- Does the structure contain data that is not inconsistent but deserves review - by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? - -The following sections describe how the metadata scrubbing process works. - -Metadata Buffer Verification -```````````````````````````` - -The lowest layer of metadata protection in XFS are the metadata verifiers built -into the buffer cache. -These functions perform inexpensive internal consistency checking of the block -itself, and answer these questions: - -- Does the block belong to this filesystem? - -- Does the block belong to the structure that asked for the read? - This assumes that metadata blocks only have one owner, which is always true - in XFS. - -- Is the type of data stored in the block within a reasonable range of what - scrub is expecting? - -- Does the physical location of the block match the location it was read from? - -- Does the block checksum match the data? - -The scope of the protections here are very limited -- verifiers can only -establish that the filesystem code is reasonably free of gross corruption bugs -and that the storage system is reasonably competent at retrieval. -Corruption problems observed at runtime cause the generation of health reports, -failed system calls, and in the extreme case, filesystem shutdowns if the -corrupt metadata force the cancellation of a dirty transaction. - -Every online fsck scrubbing function is expected to read every ondisk metadata -block of a structure in the course of checking the structure. -Corruption problems observed during a check are immediately reported to -userspace as corruption; during a cross-reference, they are reported as a -failure to cross-reference once the full examination is complete. -Reads satisfied by a buffer already in cache (and hence already verified) -bypass these checks. - -Internal Consistency Checks -``````````````````````````` - -After the buffer cache, the next level of metadata protection is the internal -record verification code built into the filesystem. -These checks are split between the buffer verifiers, the in-filesystem users of -the buffer cache, and the scrub code itself, depending on the amount of higher -level context required. -The scope of checking is still internal to the block. -These higher level checking functions answer these questions: - -- Does the type of data stored in the block match what scrub is expecting? - -- Does the block belong to the owning structure that asked for the read? - -- If the block contains records, do the records fit within the block? - -- If the block tracks internal free space information, is it consistent with - the record areas? - -- Are the records contained inside the block free of obvious corruptions? - -Record checks in this category are more rigorous and more time-intensive. -For example, block pointers and inumbers are checked to ensure that they point -within the dynamically allocated parts of an allocation group and within -the filesystem. -Names are checked for invalid characters, and flags are checked for invalid -combinations. -Other record attributes are checked for sensible values. -Btree records spanning an interval of the btree keyspace are checked for -correct order and lack of mergeability (except for file fork mappings). -For performance reasons, regular code may skip some of these checks unless -debugging is enabled or a write is about to occur. -Scrub functions, of course, must check all possible problems. - -Validation of Userspace-Controlled Record Attributes -```````````````````````````````````````````````````` - -Various pieces of filesystem metadata are directly controlled by userspace. -Because of this nature, validation work cannot be more precise than checking -that a value is within the possible range. -These fields include: - -- Superblock fields controlled by mount options -- Filesystem labels -- File timestamps -- File permissions -- File size -- File flags -- Names present in directory entries, extended attribute keys, and filesystem - labels -- Extended attribute key namespaces -- Extended attribute values -- File data block contents -- Quota limits -- Quota timer expiration (if resource usage exceeds the soft limit) - -Cross-Referencing Space Metadata -```````````````````````````````` - -After internal block checks, the next higher level of checking is -cross-referencing records between metadata structures. -For regular runtime code, the cost of these checks is considered to be -prohibitively expensive, but as scrub is dedicated to rooting out -inconsistencies, it must pursue all avenues of inquiry. -The exact set of cross-referencing is highly dependent on the context of the -data structure being checked. - -The XFS btree code has keyspace scanning functions that online fsck uses to -cross reference one structure with another. -Specifically, scrub can scan the key space of an index to determine if that -keyspace is fully, sparsely, or not at all mapped to records. -For the reverse mapping btree, it is possible to mask parts of the key for the -purposes of performing a keyspace scan so that scrub can decide if the rmap -btree contains records mapping a certain extent of physical space without the -sparsenses of the rest of the rmap keyspace getting in the way. - -Btree blocks undergo the following checks before cross-referencing: - -- Does the type of data stored in the block match what scrub is expecting? - -- Does the block belong to the owning structure that asked for the read? - -- Do the records fit within the block? - -- Are the records contained inside the block free of obvious corruptions? - -- Are the name hashes in the correct order? - -- Do node pointers within the btree point to valid block addresses for the type - of btree? - -- Do child pointers point towards the leaves? - -- Do sibling pointers point across the same level? - -- For each node block record, does the record key accurate reflect the contents - of the child block? - -Space allocation records are cross-referenced as follows: - -1. Any space mentioned by any metadata structure are cross-referenced as - follows: - - - Does the reverse mapping index list only the appropriate owner as the - owner of each block? - - - Are none of the blocks claimed as free space? - - - If these aren't file data blocks, are none of the blocks claimed as space - shared by different owners? - -2. Btree blocks are cross-referenced as follows: - - - Everything in class 1 above. - - - If there's a parent node block, do the keys listed for this block match the - keyspace of this block? - - - Do the sibling pointers point to valid blocks? Of the same level? - - - Do the child pointers point to valid blocks? Of the next level down? - -3. Free space btree records are cross-referenced as follows: - - - Everything in class 1 and 2 above. - - - Does the reverse mapping index list no owners of this space? - - - Is this space not claimed by the inode index for inodes? - - - Is it not mentioned by the reference count index? - - - Is there a matching record in the other free space btree? - -4. Inode btree records are cross-referenced as follows: - - - Everything in class 1 and 2 above. - - - Is there a matching record in free inode btree? - - - Do cleared bits in the holemask correspond with inode clusters? - - - Do set bits in the freemask correspond with inode records with zero link - count? - -5. Inode records are cross-referenced as follows: - - - Everything in class 1. - - - Do all the fields that summarize information about the file forks actually - match those forks? - - - Does each inode with zero link count correspond to a record in the free - inode btree? - -6. File fork space mapping records are cross-referenced as follows: - - - Everything in class 1 and 2 above. - - - Is this space not mentioned by the inode btrees? - - - If this is a CoW fork mapping, does it correspond to a CoW entry in the - reference count btree? - -7. Reference count records are cross-referenced as follows: - - - Everything in class 1 and 2 above. - - - Within the space subkeyspace of the rmap btree (that is to say, all - records mapped to a particular space extent and ignoring the owner info), - are there the same number of reverse mapping records for each block as the - reference count record claims? - -Proposed patchsets are the series to find gaps in -`refcount btree -`_, -`inode btree -`_, and -`rmap btree -`_ records; -to find -`mergeable records -`_; -and to -`improve cross referencing with rmap -`_ -before starting a repair. - -Checking Extended Attributes -```````````````````````````` - -Extended attributes implement a key-value store that enable fragments of data -to be attached to any file. -Both the kernel and userspace can access the keys and values, subject to -namespace and privilege restrictions. -Most typically these fragments are metadata about the file -- origins, security -contexts, user-supplied labels, indexing information, etc. - -Names can be as long as 255 bytes and can exist in several different -namespaces. -Values can be as large as 64KB. -A file's extended attributes are stored in blocks mapped by the attr fork. -The mappings point to leaf blocks, remote value blocks, or dabtree blocks. -Block 0 in the attribute fork is always the top of the structure, but otherwise -each of the three types of blocks can be found at any offset in the attr fork. -Leaf blocks contain attribute key records that point to the name and the value. -Names are always stored elsewhere in the same leaf block. -Values that are less than 3/4 the size of a filesystem block are also stored -elsewhere in the same leaf block. -Remote value blocks contain values that are too large to fit inside a leaf. -If the leaf information exceeds a single filesystem block, a dabtree (also -rooted at block 0) is created to map hashes of the attribute names to leaf -blocks in the attr fork. - -Checking an extended attribute structure is not so straightforward due to the -lack of separation between attr blocks and index blocks. -Scrub must read each block mapped by the attr fork and ignore the non-leaf -blocks: - -1. Walk the dabtree in the attr fork (if present) to ensure that there are no - irregularities in the blocks or dabtree mappings that do not point to - attr leaf blocks. - -2. Walk the blocks of the attr fork looking for leaf blocks. - For each entry inside a leaf: - - a. Validate that the name does not contain invalid characters. - - b. Read the attr value. - This performs a named lookup of the attr name to ensure the correctness - of the dabtree. - If the value is stored in a remote block, this also validates the - integrity of the remote value block. - -Checking and Cross-Referencing Directories -`````````````````````````````````````````` - -The filesystem directory tree is a directed acylic graph structure, with files -constituting the nodes, and directory entries (dirents) constituting the edges. -Directories are a special type of file containing a set of mappings from a -255-byte sequence (name) to an inumber. -These are called directory entries, or dirents for short. -Each directory file must have exactly one directory pointing to the file. -A root directory points to itself. -Directory entries point to files of any type. -Each non-directory file may have multiple directories point to it. - -In XFS, directories are implemented as a file containing up to three 32GB -partitions. -The first partition contains directory entry data blocks. -Each data block contains variable-sized records associating a user-provided -name with an inumber and, optionally, a file type. -If the directory entry data grows beyond one block, the second partition (which -exists as post-EOF extents) is populated with a block containing free space -information and an index that maps hashes of the dirent names to directory data -blocks in the first partition. -This makes directory name lookups very fast. -If this second partition grows beyond one block, the third partition is -populated with a linear array of free space information for faster -expansions. -If the free space has been separated and the second partition grows again -beyond one block, then a dabtree is used to map hashes of dirent names to -directory data blocks. - -Checking a directory is pretty straightforward: - -1. Walk the dabtree in the second partition (if present) to ensure that there - are no irregularities in the blocks or dabtree mappings that do not point to - dirent blocks. - -2. Walk the blocks of the first partition looking for directory entries. - Each dirent is checked as follows: - - a. Does the name contain no invalid characters? - - b. Does the inumber correspond to an actual, allocated inode? - - c. Does the child inode have a nonzero link count? - - d. If a file type is included in the dirent, does it match the type of the - inode? - - e. If the child is a subdirectory, does the child's dotdot pointer point - back to the parent? - - f. If the directory has a second partition, perform a named lookup of the - dirent name to ensure the correctness of the dabtree. - -3. Walk the free space list in the third partition (if present) to ensure that - the free spaces it describes are really unused. - -Checking operations involving :ref:`parents ` and -:ref:`file link counts ` are discussed in more detail in later -sections. - -Checking Directory/Attribute Btrees -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -As stated in previous sections, the directory/attribute btree (dabtree) index -maps user-provided names to improve lookup times by avoiding linear scans. -Internally, it maps a 32-bit hash of the name to a block offset within the -appropriate file fork. - -The internal structure of a dabtree closely resembles the btrees that record -fixed-size metadata records -- each dabtree block contains a magic number, a -checksum, sibling pointers, a UUID, a tree level, and a log sequence number. -The format of leaf and node records are the same -- each entry points to the -next level down in the hierarchy, with dabtree node records pointing to dabtree -leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere -in the fork. - -Checking and cross-referencing the dabtree is very similar to what is done for -space btrees: - -- Does the type of data stored in the block match what scrub is expecting? - -- Does the block belong to the owning structure that asked for the read? - -- Do the records fit within the block? - -- Are the records contained inside the block free of obvious corruptions? - -- Are the name hashes in the correct order? - -- Do node pointers within the dabtree point to valid fork offsets for dabtree - blocks? - -- Do leaf pointers within the dabtree point to valid fork offsets for directory - or attr leaf blocks? - -- Do child pointers point towards the leaves? - -- Do sibling pointers point across the same level? - -- For each dabtree node record, does the record key accurate reflect the - contents of the child dabtree block? - -- For each dabtree leaf record, does the record key accurate reflect the - contents of the directory or attr block? - -Cross-Referencing Summary Counters -`````````````````````````````````` - -XFS maintains three classes of summary counters: available resources, quota -resource usage, and file link counts. - -In theory, the amount of available resources (data blocks, inodes, realtime -extents) can be found by walking the entire filesystem. -This would make for very slow reporting, so a transactional filesystem can -maintain summaries of this information in the superblock. -Cross-referencing these values against the filesystem metadata should be a -simple matter of walking the free space and inode metadata in each AG and the -realtime bitmap, but there are complications that will be discussed in -:ref:`more detail ` later. - -:ref:`Quota usage ` and :ref:`file link count ` -checking are sufficiently complicated to warrant separate sections. - -Post-Repair Reverification -`````````````````````````` - -After performing a repair, the checking code is run a second time to validate -the new structure, and the results of the health assessment are recorded -internally and returned to the calling process. -This step is critical for enabling system administrator to monitor the status -of the filesystem and the progress of any repairs. -For developers, it is a useful means to judge the efficacy of error detection -and correction in the online and offline checking tools. - -Eventual Consistency vs. Online Fsck ------------------------------------- - -Complex operations can make modifications to multiple per-AG data structures -with a chain of transactions. -These chains, once committed to the log, are restarted during log recovery if -the system crashes while processing the chain. -Because the AG header buffers are unlocked between transactions within a chain, -online checking must coordinate with chained operations that are in progress to -avoid incorrectly detecting inconsistencies due to pending chains. -Furthermore, online repair must not run when operations are pending because -the metadata are temporarily inconsistent with each other, and rebuilding is -not possible. - -Only online fsck has this requirement of total consistency of AG metadata, and -should be relatively rare as compared to filesystem change operations. -Online fsck coordinates with transaction chains as follows: - -* For each AG, maintain a count of intent items targeting that AG. - The count should be bumped whenever a new item is added to the chain. - The count should be dropped when the filesystem has locked the AG header - buffers and finished the work. - -* When online fsck wants to examine an AG, it should lock the AG header - buffers to quiesce all transaction chains that want to modify that AG. - If the count is zero, proceed with the checking operation. - If it is nonzero, cycle the buffer locks to allow the chain to make forward - progress. - -This may lead to online fsck taking a long time to complete, but regular -filesystem updates take precedence over background checking activity. -Details about the discovery of this situation are presented in the -:ref:`next section `, and details about the solution -are presented :ref:`after that`. - -.. _chain_coordination: - -Discovery of the Problem -```````````````````````` - -Midway through the development of online scrubbing, the fsstress tests -uncovered a misinteraction between online fsck and compound transaction chains -created by other writer threads that resulted in false reports of metadata -inconsistency. -The root cause of these reports is the eventual consistency model introduced by -the expansion of deferred work items and compound transaction chains when -reverse mapping and reflink were introduced. - -Originally, transaction chains were added to XFS to avoid deadlocks when -unmapping space from files. -Deadlock avoidance rules require that AGs only be locked in increasing order, -which makes it impossible (say) to use a single transaction to free a space -extent in AG 7 and then try to free a now superfluous block mapping btree block -in AG 3. -To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log -items to commit to freeing some space in one transaction while deferring the -actual metadata updates to a fresh transaction. -The transaction sequence looks like this: - -1. The first transaction contains a physical update to the file's block mapping - structures to remove the mapping from the btree blocks. - It then attaches to the in-memory transaction an action item to schedule - deferred freeing of space. - Concretely, each transaction maintains a list of ``struct - xfs_defer_pending`` objects, each of which maintains a list of ``struct - xfs_extent_free_item`` objects. - Returning to the example above, the action item tracks the freeing of both - the unmapped space from AG 7 and the block mapping btree (BMBT) block from - AG 3. - Deferred frees recorded in this manner are committed in the log by creating - an EFI log item from the ``struct xfs_extent_free_item`` object and - attaching the log item to the transaction. - When the log is persisted to disk, the EFI item is written into the ondisk - transaction record. - EFIs can list up to 16 extents to free, all sorted in AG order. - -2. The second transaction contains a physical update to the free space btrees - of AG 3 to release the former BMBT block and a second physical update to the - free space btrees of AG 7 to release the unmapped file space. - Observe that the physical updates are resequenced in the correct order - when possible. - Attached to the transaction is a an extent free done (EFD) log item. - The EFD contains a pointer to the EFI logged in transaction #1 so that log - recovery can tell if the EFI needs to be replayed. - -If the system goes down after transaction #1 is written back to the filesystem -but before #2 is committed, a scan of the filesystem metadata would show -inconsistent filesystem metadata because there would not appear to be any owner -of the unmapped space. -Happily, log recovery corrects this inconsistency for us -- when recovery finds -an intent log item but does not find a corresponding intent done item, it will -reconstruct the incore state of the intent item and finish it. -In the example above, the log must replay both frees described in the recovered -EFI to complete the recovery phase. - -There are subtleties to XFS' transaction chaining strategy to consider: - -* Log items must be added to a transaction in the correct order to prevent - conflicts with principal objects that are not held by the transaction. - In other words, all per-AG metadata updates for an unmapped block must be - completed before the last update to free the extent, and extents should not - be reallocated until that last update commits to the log. - -* AG header buffers are released between each transaction in a chain. - This means that other threads can observe an AG in an intermediate state, - but as long as the first subtlety is handled, this should not affect the - correctness of filesystem operations. - -* Unmounting the filesystem flushes all pending work to disk, which means that - offline fsck never sees the temporary inconsistencies caused by deferred - work item processing. - -In this manner, XFS employs a form of eventual consistency to avoid deadlocks -and increase parallelism. - -During the design phase of the reverse mapping and reflink features, it was -decided that it was impractical to cram all the reverse mapping updates for a -single filesystem change into a single transaction because a single file -mapping operation can explode into many small updates: - -* The block mapping update itself -* A reverse mapping update for the block mapping update -* Fixing the freelist -* A reverse mapping update for the freelist fix - -* A shape change to the block mapping btree -* A reverse mapping update for the btree update -* Fixing the freelist (again) -* A reverse mapping update for the freelist fix - -* An update to the reference counting information -* A reverse mapping update for the refcount update -* Fixing the freelist (a third time) -* A reverse mapping update for the freelist fix - -* Freeing any space that was unmapped and not owned by any other file -* Fixing the freelist (a fourth time) -* A reverse mapping update for the freelist fix - -* Freeing the space used by the block mapping btree -* Fixing the freelist (a fifth time) -* A reverse mapping update for the freelist fix - -Free list fixups are not usually needed more than once per AG per transaction -chain, but it is theoretically possible if space is very tight. -For copy-on-write updates this is even worse, because this must be done once to -remove the space from a staging area and again to map it into the file! - -To deal with this explosion in a calm manner, XFS expands its use of deferred -work items to cover most reverse mapping updates and all refcount updates. -This reduces the worst case size of transaction reservations by breaking the -work into a long chain of small updates, which increases the degree of eventual -consistency in the system. -Again, this generally isn't a problem because XFS orders its deferred work -items carefully to avoid resource reuse conflicts between unsuspecting threads. - -However, online fsck changes the rules -- remember that although physical -updates to per-AG structures are coordinated by locking the buffers for AG -headers, buffer locks are dropped between transactions. -Once scrub acquires resources and takes locks for a data structure, it must do -all the validation work without releasing the lock. -If the main lock for a space btree is an AG header buffer lock, scrub may have -interrupted another thread that is midway through finishing a chain. -For example, if a thread performing a copy-on-write has completed a reverse -mapping update but not the corresponding refcount update, the two AG btrees -will appear inconsistent to scrub and an observation of corruption will be -recorded. This observation will not be correct. -If a repair is attempted in this state, the results will be catastrophic! - -Several other solutions to this problem were evaluated upon discovery of this -flaw and rejected: - -1. Add a higher level lock to allocation groups and require writer threads to - acquire the higher level lock in AG order before making any changes. - This would be very difficult to implement in practice because it is - difficult to determine which locks need to be obtained, and in what order, - without simulating the entire operation. - Performing a dry run of a file operation to discover necessary locks would - make the filesystem very slow. - -2. Make the deferred work coordinator code aware of consecutive intent items - targeting the same AG and have it hold the AG header buffers locked across - the transaction roll between updates. - This would introduce a lot of complexity into the coordinator since it is - only loosely coupled with the actual deferred work items. - It would also fail to solve the problem because deferred work items can - generate new deferred subtasks, but all subtasks must be complete before - work can start on a new sibling task. - -3. Teach online fsck to walk all transactions waiting for whichever lock(s) - protect the data structure being scrubbed to look for pending operations. - The checking and repair operations must factor these pending operations into - the evaluations being performed. - This solution is a nonstarter because it is *extremely* invasive to the main - filesystem. - -.. _intent_drains: - -Intent Drains -````````````` - -Online fsck uses an atomic intent item counter and lock cycling to coordinate -with transaction chains. -There are two key properties to the drain mechanism. -First, the counter is incremented when a deferred work item is *queued* to a -transaction, and it is decremented after the associated intent done log item is -*committed* to another transaction. -The second property is that deferred work can be added to a transaction without -holding an AG header lock, but per-AG work items cannot be marked done without -locking that AG header buffer to log the physical updates and the intent done -log item. -The first property enables scrub to yield to running transaction chains, which -is an explicit deprioritization of online fsck to benefit file operations. -The second property of the drain is key to the correct coordination of scrub, -since scrub will always be able to decide if a conflict is possible. - -For regular filesystem code, the drain works as follows: - -1. Call the appropriate subsystem function to add a deferred work item to a - transaction. - -2. The function calls ``xfs_defer_drain_bump`` to increase the counter. - -3. When the deferred item manager wants to finish the deferred work item, it - calls ``->finish_item`` to complete it. - -4. The ``->finish_item`` implementation logs some changes and calls - ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads - waiting on the drain. - -5. The subtransaction commits, which unlocks the resource associated with the - intent item. - -For scrub, the drain works as follows: - -1. Lock the resource(s) associated with the metadata being scrubbed. - For example, a scan of the refcount btree would lock the AGI and AGF header - buffers. - -2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no - chains in progress and the operation may proceed. - -3. Otherwise, release the resources grabbed in step 1. - -4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go - back to step 1 unless a signal has been caught. - -To avoid polling in step 4, the drain provides a waitqueue for scrub threads to -be woken up whenever the intent count drops to zero. - -The proposed patchset is the -`scrub intent drain series -`_. - -.. _jump_labels: - -Static Keys (aka Jump Label Patching) -````````````````````````````````````` - -Online fsck for XFS separates the regular filesystem from the checking and -repair code as much as possible. -However, there are a few parts of online fsck (such as the intent drains, and -later, live update hooks) where it is useful for the online fsck code to know -what's going on in the rest of the filesystem. -Since it is not expected that online fsck will be constantly running in the -background, it is very important to minimize the runtime overhead imposed by -these hooks when online fsck is compiled into the kernel but not actively -running on behalf of userspace. -Taking locks in the hot path of a writer thread to access a data structure only -to find that no further action is necessary is expensive -- on the author's -computer, this have an overhead of 40-50ns per access. -Fortunately, the kernel supports dynamic code patching, which enables XFS to -replace a static branch to hook code with ``nop`` sleds when online fsck isn't -running. -This sled has an overhead of however long it takes the instruction decoder to -skip past the sled, which seems to be on the order of less than 1ns and -does not access memory outside of instruction fetching. - -When online fsck enables the static key, the sled is replaced with an -unconditional branch to call the hook code. -The switchover is quite expensive (~22000ns) but is paid entirely by the -program that invoked online fsck, and can be amortized if multiple threads -enter online fsck at the same time, or if multiple filesystems are being -checked at the same time. -Changing the branch direction requires taking the CPU hotplug lock, and since -CPU initialization requires memory allocation, online fsck must be careful not -to change a static key while holding any locks or resources that could be -accessed in the memory reclaim paths. -To minimize contention on the CPU hotplug lock, care should be taken not to -enable or disable static keys unnecessarily. - -Because static keys are intended to minimize hook overhead for regular -filesystem operations when xfs_scrub is not running, the intended usage -patterns are as follows: - -- The hooked part of XFS should declare a static-scoped static key that - defaults to false. - The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. - The static key itself should be declared as a ``static`` variable. - -- When deciding to invoke code that's only used by scrub, the regular - filesystem should call the ``static_branch_unlikely`` predicate to avoid the - scrub-only hook code if the static key is not enabled. - -- The regular filesystem should export helper functions that call - ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the - static key. - Wrapper functions make it easy to compile out the relevant code if the kernel - distributor turns off online fsck at build time. - -- Scrub functions wanting to turn on scrub-only XFS functionality should call - the ``xchk_fsgates_enable`` from the setup function to enable a specific - hook. - This must be done before obtaining any resources that are used by memory - reclaim. - Callers had better be sure they really need the functionality gated by the - static key; the ``TRY_HARDER`` flag is useful here. - -Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to -handle locking AGI and AGF buffers for all scrubber functions. -If it detects a conflict between scrub and the running transactions, it will -try to wait for intents to complete. -If the caller of the helper has not enabled the static key, the helper will -return -EDEADLOCK, which should result in the scrub being restarted with the -``TRY_HARDER`` flag set. -The scrub setup function should detect that flag, enable the static key, and -try the scrub again. -Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. - -For more information, please see the kernel documentation of -Documentation/staging/static-keys.rst. - -.. _xfile: - -Pageable Kernel Memory ----------------------- - -Some online checking functions work by scanning the filesystem to build a -shadow copy of an ondisk metadata structure in memory and comparing the two -copies. -For online repair to rebuild a metadata structure, it must compute the record -set that will be stored in the new structure before it can persist that new -structure to disk. -Ideally, repairs complete with a single atomic commit that introduces -a new data structure. -To meet these goals, the kernel needs to collect a large amount of information -in a place that doesn't require the correct operation of the filesystem. - -Kernel memory isn't suitable because: - -* Allocating a contiguous region of memory to create a C array is very - difficult, especially on 32-bit systems. - -* Linked lists of records introduce double pointer overhead which is very high - and eliminate the possibility of indexed lookups. - -* Kernel memory is pinned, which can drive the system into OOM conditions. - -* The system might not have sufficient memory to stage all the information. - -At any given time, online fsck does not need to keep the entire record set in -memory, which means that individual records can be paged out if necessary. -Continued development of online fsck demonstrated that the ability to perform -indexed data storage would also be very useful. -Fortunately, the Linux kernel already has a facility for byte-addressable and -pageable storage: tmpfs. -In-kernel graphics drivers (most notably i915) take advantage of tmpfs files -to store intermediate data that doesn't need to be in memory at all times, so -that usage precedent is already established. -Hence, the ``xfile`` was born! - -+--------------------------------------------------------------------------+ -| **Historical Sidebar**: | -+--------------------------------------------------------------------------+ -| The first edition of online repair inserted records into a new btree as | -| it found them, which failed because filesystem could shut down with a | -| built data structure, which would be live after recovery finished. | -| | -| The second edition solved the half-rebuilt structure problem by storing | -| everything in memory, but frequently ran the system out of memory. | -| | -| The third edition solved the OOM problem by using linked lists, but the | -| memory overhead of the list pointers was extreme. | -+--------------------------------------------------------------------------+ - -xfile Access Models -``````````````````` - -A survey of the intended uses of xfiles suggested these use cases: - -1. Arrays of fixed-sized records (space management btrees, directory and - extended attribute entries) - -2. Sparse arrays of fixed-sized records (quotas and link counts) - -3. Large binary objects (BLOBs) of variable sizes (directory and extended - attribute names and values) - -4. Staging btrees in memory (reverse mapping btrees) - -5. Arbitrary contents (realtime space management) - -To support the first four use cases, high level data structures wrap the xfile -to share functionality between online fsck functions. -The rest of this section discusses the interfaces that the xfile presents to -four of those five higher level data structures. -The fifth use case is discussed in the :ref:`realtime summary ` case -study. - -The most general storage interface supported by the xfile enables the reading -and writing of arbitrary quantities of data at arbitrary offsets in the xfile. -This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, -which behave similarly to their userspace counterparts. -XFS is very record-based, which suggests that the ability to load and store -complete records is important. -To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` -functions are provided to read and persist objects into an xfile. -They are internally the same as pread and pwrite, except that they treat any -error as an out of memory error. -For online repair, squashing error conditions in this manner is an acceptable -behavior because the only reaction is to abort the operation back to userspace. -All five xfile usecases can be serviced by these four functions. - -However, no discussion of file access idioms is complete without answering the -question, "But what about mmap?" -It is convenient to access storage directly with pointers, just like userspace -code does with regular memory. -Online fsck must not drive the system into OOM conditions, which means that -xfiles must be responsive to memory reclamation. -tmpfs can only push a pagecache folio to the swap cache if the folio is neither -pinned nor locked, which means the xfile must not pin too many folios. - -Short term direct access to xfile contents is done by locking the pagecache -folio and mapping it into kernel address space. -Programmatic access (e.g. pread and pwrite) uses this mechanism. -Folio locks are not supposed to be held for long periods of time, so long -term direct access to xfile contents is done by bumping the folio refcount, -mapping it into kernel address space, and dropping the folio lock. -These long term users *must* be responsive to memory reclaim by hooking into -the shrinker infrastructure to know when to release folios. - -The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to -retrieve the (locked) folio that backs part of an xfile and to release it. -The only code to use these folio lease functions are the xfarray -:ref:`sorting` algorithms and the :ref:`in-memory -btrees`. - -xfile Access Coordination -````````````````````````` - -For security reasons, xfiles must be owned privately by the kernel. -They are marked ``S_PRIVATE`` to prevent interference from the security system, -must never be mapped into process file descriptor tables, and their pages must -never be mapped into userspace processes. - -To avoid locking recursion issues with the VFS, all accesses to the shmfs file -are performed by manipulating the page cache directly. -xfile writers call the ``->write_begin`` and ``->write_end`` functions of the -xfile's address space to grab writable pages, copy the caller's buffer into the -page, and release the pages. -xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly -before copying the contents into the caller's buffer. -In other words, xfiles ignore the VFS read and write code paths to avoid -having to create a dummy ``struct kiocb`` and to avoid taking inode and -freeze locks. -tmpfs cannot be frozen, and xfiles must not be exposed to userspace. - -If an xfile is shared between threads to stage repairs, the caller must provide -its own locks to coordinate access. -For example, if a scrub function stores scan results in an xfile and needs -other threads to provide updates to the scanned data, the scrub function must -provide a lock for all threads to share. - -.. _xfarray: - -Arrays of Fixed-Sized Records -````````````````````````````` - -In XFS, each type of indexed space metadata (free space, inodes, reference -counts, file fork space, and reverse mappings) consists of a set of fixed-size -records indexed with a classic B+ tree. -Directories have a set of fixed-size dirent records that point to the names, -and extended attributes have a set of fixed-size attribute keys that point to -names and values. -Quota counters and file link counters index records with numbers. -During a repair, scrub needs to stage new records during the gathering step and -retrieve them during the btree building step. - -Although this requirement can be satisfied by calling the read and write -methods of the xfile directly, it is simpler for callers for there to be a -higher level abstraction to take care of computing array offsets, to provide -iterator functions, and to deal with sparse records and sorting. -The ``xfarray`` abstraction presents a linear array for fixed-size records atop -the byte-accessible xfile. - -.. _xfarray_access_patterns: - -Array Access Patterns -^^^^^^^^^^^^^^^^^^^^^ - -Array access patterns in online fsck tend to fall into three categories. -Iteration of records is assumed to be necessary for all cases and will be -covered in the next section. - -The first type of caller handles records that are indexed by position. -Gaps may exist between records, and a record may be updated multiple times -during the collection step. -In other words, these callers want a sparse linearly addressed table file. -The typical use case are quota records or file link count records. -Access to array elements is performed programmatically via ``xfarray_load`` and -``xfarray_store`` functions, which wrap the similarly-named xfile functions to -provide loading and storing of array elements at arbitrary array indices. -Gaps are defined to be null records, and null records are defined to be a -sequence of all zero bytes. -Null records are detected by calling ``xfarray_element_is_null``. -They are created either by calling ``xfarray_unset`` to null out an existing -record or by never storing anything to an array index. - -The second type of caller handles records that are not indexed by position -and do not require multiple updates to a record. -The typical use case here is rebuilding space btrees and key/value btrees. -These callers can add records to the array without caring about array indices -via the ``xfarray_append`` function, which stores a record at the end of the -array. -For callers that require records to be presentable in a specific order (e.g. -rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted -records; this function will be covered later. - -The third type of caller is a bag, which is useful for counting records. -The typical use case here is constructing space extent reference counts from -reverse mapping information. -Records can be put in the bag in any order, they can be removed from the bag -at any time, and uniqueness of records is left to callers. -The ``xfarray_store_anywhere`` function is used to insert a record in any -null record slot in the bag; and the ``xfarray_unset`` function removes a -record from the bag. - -The proposed patchset is the -`big in-memory array -`_. - -Iterating Array Elements -^^^^^^^^^^^^^^^^^^^^^^^^ - -Most users of the xfarray require the ability to iterate the records stored in -the array. -Callers can probe every possible array index with the following: - -.. code-block:: c - - xfarray_idx_t i; - foreach_xfarray_idx(array, i) { - xfarray_load(array, i, &rec); - - /* do something with rec */ - } - -All users of this idiom must be prepared to handle null records or must already -know that there aren't any. - -For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` -function ignores indices in the xfarray that have never been written to by -calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas -of the array that are not populated with memory pages. -Once it finds a page, it will skip the zeroed areas of the page. - -.. code-block:: c - - xfarray_idx_t i = XFARRAY_CURSOR_INIT; - while ((ret = xfarray_iter(array, &i, &rec)) == 1) { - /* do something with rec */ - } - -.. _xfarray_sort: - -Sorting Array Elements -^^^^^^^^^^^^^^^^^^^^^^ - -During the fourth demonstration of online repair, a community reviewer remarked -that for performance reasons, online repair ought to load batches of records -into btree record blocks instead of inserting records into a new btree one at a -time. -The btree insertion code in XFS is responsible for maintaining correct ordering -of the records, so naturally the xfarray must also support sorting the record -set prior to bulk loading. - -Case Study: Sorting xfarrays -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The sorting algorithm used in the xfarray is actually a combination of adaptive -quicksort and a heapsort subalgorithm in the spirit of -`Sedgewick `_ and -`pdqsort `_, with customizations for the Linux -kernel. -To sort records in a reasonably short amount of time, ``xfarray`` takes -advantage of the binary subpartitioning offered by quicksort, but it also uses -heapsort to hedge against performance collapse if the chosen quicksort pivots -are poor. -Both algorithms are (in general) O(n * lg(n)), but there is a wide performance -gulf between the two implementations. - -The Linux kernel already contains a reasonably fast implementation of heapsort. -It only operates on regular C arrays, which limits the scope of its usefulness. -There are two key places where the xfarray uses it: - -* Sorting any record subset backed by a single xfile page. - -* Loading a small number of xfarray records from potentially disparate parts - of the xfarray into a memory buffer, and sorting the buffer. - -In other words, ``xfarray`` uses heapsort to constrain the nested recursion of -quicksort, thereby mitigating quicksort's worst runtime behavior. - -Choosing a quicksort pivot is a tricky business. -A good pivot splits the set to sort in half, leading to the divide and conquer -behavior that is crucial to O(n * lg(n)) performance. -A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) -runtime. -The xfarray sort routine tries to avoid picking a bad pivot by sampling nine -records into a memory buffer and using the kernel heapsort to identify the -median of the nine. - -Most modern quicksort implementations employ Tukey's "ninther" to select a -pivot from a classic C array. -Typical ninther implementations pick three unique triads of records, sort each -of the triads, and then sort the middle value of each triad to determine the -ninther value. -As stated previously, however, xfile accesses are not entirely cheap. -It turned out to be much more performant to read the nine elements into a -memory buffer, run the kernel's in-memory heapsort on the buffer, and choose -the 4th element of that buffer as the pivot. -Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for -low-effort robust (resistant) location in large samples`, in *Contributions to -Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, -1978), pp. 251–257. - -The partitioning of quicksort is fairly textbook -- rearrange the record -subset around the pivot, then set up the current and next stack frames to -sort with the larger and the smaller halves of the pivot, respectively. -This keeps the stack space requirements to log2(record count). - -As a final performance optimization, the hi and lo scanning phase of quicksort -keeps examined xfile pages mapped in the kernel for as long as possible to -reduce map/unmap cycles. -Surprisingly, this reduces overall sort runtime by nearly half again after -accounting for the application of heapsort directly onto xfile pages. - -.. _xfblob: - -Blob Storage -```````````` - -Extended attributes and directories add an additional requirement for staging -records: arbitrary byte sequences of finite length. -Each directory entry record needs to store entry name, -and each extended attribute needs to store both the attribute name and value. -The names, keys, and values can consume a large amount of memory, so the -``xfblob`` abstraction was created to simplify management of these blobs -atop an xfile. - -Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve -and persist objects. -The store function returns a magic cookie for every object that it persists. -Later, callers provide this cookie to the ``xblob_load`` to recall the object. -The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` -function frees them all because compaction is not needed. - -The details of repairing directories and extended attributes will be discussed -in a subsequent section about atomic extent swapping. -However, it should be noted that these repair functions only use blob storage -to cache a small number of entries before adding them to a temporary ondisk -file, which is why compaction is not required. - -The proposed patchset is at the start of the -`extended attribute repair -`_ series. - -.. _xfbtree: - -In-Memory B+Trees -````````````````` - -The chapter about :ref:`secondary metadata` mentioned that -checking and repairing of secondary metadata commonly requires coordination -between a live metadata scan of the filesystem and writer threads that are -updating that metadata. -Keeping the scan data up to date requires requires the ability to propagate -metadata updates from the filesystem into the data being collected by the scan. -This *can* be done by appending concurrent updates into a separate log file and -applying them before writing the new metadata to disk, but this leads to -unbounded memory consumption if the rest of the system is very busy. -Another option is to skip the side-log and commit live updates from the -filesystem directly into the scan data, which trades more overhead for a lower -maximum memory requirement. -In both cases, the data structure holding the scan results must support indexed -access to perform well. - -Given that indexed lookups of scan data is required for both strategies, online -fsck employs the second strategy of committing live updates directly into -scan data. -Because xfarrays are not indexed and do not enforce record ordering, they -are not suitable for this task. -Conveniently, however, XFS has a library to create and maintain ordered reverse -mapping records: the existing rmap btree code! -If only there was a means to create one in memory. - -Recall that the :ref:`xfile ` abstraction represents memory pages as a -regular file, which means that the kernel can create byte or block addressable -virtual address spaces at will. -The XFS buffer cache specializes in abstracting IO to block-oriented address -spaces, which means that adaptation of the buffer cache to interface with -xfiles enables reuse of the entire btree library. -Btrees built atop an xfile are collectively known as ``xfbtrees``. -The next few sections describe how they actually work. - -The proposed patchset is the -`in-memory btree -`_ -series. - -Using xfiles as a Buffer Cache Target -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Two modifications are necessary to support xfiles as a buffer cache target. -The first is to make it possible for the ``struct xfs_buftarg`` structure to -host the ``struct xfs_buf`` rhashtable, because normally those are held by a -per-AG structure. -The second change is to modify the buffer ``ioapply`` function to "read" cached -pages from the xfile and "write" cached pages back to the xfile. -Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, -since the xfile does not provide any locking on its own. -With this adaptation in place, users of the xfile-backed buffer cache use -exactly the same APIs as users of the disk-backed buffer cache. -The separation between xfile and buffer cache implies higher memory usage since -they do not share pages, but this property could some day enable transactional -updates to an in-memory btree. -Today, however, it simply eliminates the need for new code. - -Space Management with an xfbtree -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Space management for an xfile is very simple -- each btree block is one memory -page in size. -These blocks use the same header format as an on-disk btree, but the in-memory -block verifiers ignore the checksums, assuming that xfile memory is no more -corruption-prone than regular DRAM. -Reusing existing code here is more important than absolute memory efficiency. - -The very first block of an xfile backing an xfbtree contains a header block. -The header describes the owner, height, and the block number of the root -xfbtree block. - -To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. -If there are no gaps, create one by extending the length of the xfile. -Preallocate space for the block with ``xfile_prealloc``, and hand back the -location. -To free an xfbtree block, use ``xfile_discard`` (which internally uses -``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. - -Populating an xfbtree -^^^^^^^^^^^^^^^^^^^^^ - -An online fsck function that wants to create an xfbtree should proceed as -follows: - -1. Call ``xfile_create`` to create an xfile. - -2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure - pointing to the xfile. - -3. Pass the buffer cache target, buffer ops, and other information to - ``xfbtree_create`` to write an initial tree header and root block to the - xfile. - Each btree type should define a wrapper that passes necessary arguments to - the creation function. - For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of - all the necessary details for callers. - A ``struct xfbtree`` object will be returned. - -4. Pass the xfbtree object to the btree cursor creation function for the - btree type. - Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this - for callers. - -5. Pass the btree cursor to the regular btree functions to make queries against - and to update the in-memory btree. - For example, a btree cursor for an rmap xfbtree can be passed to the - ``xfs_rmap_*`` functions just like any other btree cursor. - See the :ref:`next section` for information on dealing with - xfbtree updates that are logged to a transaction. - -6. When finished, delete the btree cursor, destroy the xfbtree object, free the - buffer target, and the destroy the xfile to release all resources. - -.. _xfbtree_commit: - -Committing Logged xfbtree Buffers -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Although it is a clever hack to reuse the rmap btree code to handle the staging -structure, the ephemeral nature of the in-memory btree block storage presents -some challenges of its own. -The XFS transaction manager must not commit buffer log items for buffers backed -by an xfile because the log format does not understand updates for devices -other than the data device. -An ephemeral xfbtree probably will not exist by the time the AIL checkpoints -log transactions back into the filesystem, and certainly won't exist during -log recovery. -For these reasons, any code updating an xfbtree in transaction context must -remove the buffer log items from the transaction and write the updates into the -backing xfile before committing or cancelling the transaction. - -The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement -this functionality as follows: - -1. Find each buffer log item whose buffer targets the xfile. - -2. Record the dirty/ordered status of the log item. - -3. Detach the log item from the buffer. - -4. Queue the buffer to a special delwri list. - -5. Clear the transaction dirty flag if the only dirty log items were the ones - that were detached in step 3. - -6. Submit the delwri list to commit the changes to the xfile, if the updates - are being committed. - -After removing xfile logged buffers from the transaction in this manner, the -transaction can be committed or cancelled. - -Bulk Loading of Ondisk B+Trees ------------------------------- - -As mentioned previously, early iterations of online repair built new btree -structures by creating a new btree and adding observations individually. -Loading a btree one record at a time had a slight advantage of not requiring -the incore records to be sorted prior to commit, but was very slow and leaked -blocks if the system went down during a repair. -Loading records one at a time also meant that repair could not control the -loading factor of the blocks in the new btree. - -Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for -rebuilding a btree index from a collection of records -- bulk btree loading. -This was implemented rather inefficiently code-wise, since ``xfs_repair`` -had separate copy-pasted implementations for each btree type. - -To prepare for online fsck, each of the four bulk loaders were studied, notes -were taken, and the four were refactored into a single generic btree bulk -loading mechanism. -Those notes in turn have been refreshed and are presented below. - -Geometry Computation -```````````````````` - -The zeroth step of bulk loading is to assemble the entire record set that will -be stored in the new btree, and sort the records. -Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the -btree from the record set, the type of btree, and any load factor preferences. -This information is required for resource reservation. - -First, the geometry computation computes the minimum and maximum records that -will fit in a leaf block from the size of a btree block and the size of the -block header. -Roughly speaking, the maximum number of records is:: - - maxrecs = (block_size - header_size) / record_size - -The XFS design specifies that btree blocks should be merged when possible, -which means the minimum number of records is half of maxrecs:: - - minrecs = maxrecs / 2 - -The next variable to determine is the desired loading factor. -This must be at least minrecs and no more than maxrecs. -Choosing minrecs is undesirable because it wastes half the block. -Choosing maxrecs is also undesirable because adding a single record to each -newly rebuilt leaf block will cause a tree split, which causes a noticeable -drop in performance immediately afterwards. -The default loading factor was chosen to be 75% of maxrecs, which provides a -reasonably compact structure without any immediate split penalties:: - - default_load_factor = (maxrecs + minrecs) / 2 - -If space is tight, the loading factor will be set to maxrecs to try to avoid -running out of space:: - - leaf_load_factor = enough space ? default_load_factor : maxrecs - -Load factor is computed for btree node blocks using the combined size of the -btree key and pointer as the record size:: - - maxrecs = (block_size - header_size) / (key_size + ptr_size) - minrecs = maxrecs / 2 - node_load_factor = enough space ? default_load_factor : maxrecs - -Once that's done, the number of leaf blocks required to store the record set -can be computed as:: - - leaf_blocks = ceil(record_count / leaf_load_factor) - -The number of node blocks needed to point to the next level down in the tree -is computed as:: - - n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) - node_blocks[n + 1] = ceil(n_blocks / node_load_factor) - -The entire computation is performed recursively until the current level only -needs one block. -The resulting geometry is as follows: - -- For AG-rooted btrees, this level is the root level, so the height of the new - tree is ``level + 1`` and the space needed is the summation of the number of - blocks on each level. - -- For inode-rooted btrees where the records in the top level do not fit in the - inode fork area, the height is ``level + 2``, the space needed is the - summation of the number of blocks on each level, and the inode fork points to - the root block. - -- For inode-rooted btrees where the records in the top level can be stored in - the inode fork area, then the root block can be stored in the inode, the - height is ``level + 1``, and the space needed is one less than the summation - of the number of blocks on each level. - This only becomes relevant when non-bmap btrees gain the ability to root in - an inode, which is a future patchset and only included here for completeness. - -.. _newbt: - -Reserving New B+Tree Blocks -``````````````````````````` - -Once repair knows the number of blocks needed for the new btree, it allocates -those blocks using the free space information. -Each reserved extent is tracked separately by the btree builder state data. -To improve crash resilience, the reservation code also logs an Extent Freeing -Intent (EFI) item in the same transaction as each space allocation and attaches -its in-memory ``struct xfs_extent_free_item`` object to the space reservation. -If the system goes down, log recovery will use the unfinished EFIs to free the -unused space, the free space, leaving the filesystem unchanged. - -Each time the btree builder claims a block for the btree from a reserved -extent, it updates the in-memory reservation to reflect the claimed space. -Block reservation tries to allocate as much contiguous space as possible to -reduce the number of EFIs in play. - -While repair is writing these new btree blocks, the EFIs created for the space -reservations pin the tail of the ondisk log. -It's possible that other parts of the system will remain busy and push the head -of the log towards the pinned tail. -To avoid livelocking the filesystem, the EFIs must not pin the tail of the log -for too long. -To alleviate this problem, the dynamic relogging capability of the deferred ops -mechanism is reused here to commit a transaction at the log head containing an -EFD for the old EFI and new EFI at the head. -This enables the log to release the old EFI to keep the log moving forwards. - -EFIs have a role to play during the commit and reaping phases; please see the -next section and the section about :ref:`reaping` for more details. - -Proposed patchsets are the -`bitmap rework -`_ -and the -`preparation for bulk loading btrees -`_. - - -Writing the New Tree -```````````````````` - -This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims -a block from the reserved list, writes the new btree block header, fills the -rest of the block with records, and adds the new leaf block to a list of -written blocks:: - - ┌────┐ - │leaf│ - │RRR │ - └────┘ - -Sibling pointers are set every time a new block is added to the level:: - - ┌────┐ ┌────┐ ┌────┐ ┌────┐ - │leaf│→│leaf│→│leaf│→│leaf│ - │RRR │←│RRR │←│RRR │←│RRR │ - └────┘ └────┘ └────┘ └────┘ - -When it finishes writing the record leaf blocks, it moves on to the node -blocks -To fill a node block, it walks each block in the next level down in the tree -to compute the relevant keys and write them into the parent node:: - - ┌────┐ ┌────┐ - │node│──────→│node│ - │PP │←──────│PP │ - └────┘ └────┘ - ↙ ↘ ↙ ↘ - ┌────┐ ┌────┐ ┌────┐ ┌────┐ - │leaf│→│leaf│→│leaf│→│leaf│ - │RRR │←│RRR │←│RRR │←│RRR │ - └────┘ └────┘ └────┘ └────┘ - -When it reaches the root level, it is ready to commit the new btree!:: - - ┌─────────┐ - │ root │ - │ PP │ - └─────────┘ - ↙ ↘ - ┌────┐ ┌────┐ - │node│──────→│node│ - │PP │←──────│PP │ - └────┘ └────┘ - ↙ ↘ ↙ ↘ - ┌────┐ ┌────┐ ┌────┐ ┌────┐ - │leaf│→│leaf│→│leaf│→│leaf│ - │RRR │←│RRR │←│RRR │←│RRR │ - └────┘ └────┘ └────┘ └────┘ - -The first step to commit the new btree is to persist the btree blocks to disk -synchronously. -This is a little complicated because a new btree block could have been freed -in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to -remove the (stale) buffer from the AIL list before it can write the new blocks -to disk. -Blocks are queued for IO using a delwri list and written in one large batch -with ``xfs_buf_delwri_submit``. - -Once the new blocks have been persisted to disk, control returns to the -individual repair function that called the bulk loader. -The repair function must log the location of the new root in a transaction, -clean up the space reservations that were made for the new btree, and reap the -old metadata blocks: - -1. Commit the location of the new btree root. - -2. For each incore reservation: - - a. Log Extent Freeing Done (EFD) items for all the space that was consumed - by the btree builder. The new EFDs must point to the EFIs attached to - the reservation to prevent log recovery from freeing the new blocks. - - b. For unclaimed portions of incore reservations, create a regular deferred - extent free work item to be free the unused space later in the - transaction chain. - - c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the - reservation of the committing transaction. - If the btree loading code suspects this might be about to happen, it must - call ``xrep_defer_finish`` to clear out the deferred work and obtain a - fresh transaction. - -3. Clear out the deferred work a second time to finish the commit and clean - the repair transaction. - -The transaction rolling in steps 2c and 3 represent a weakness in the repair -algorithm, because a log flush and a crash before the end of the reap step can -result in space leaking. -Online repair functions minimize the chances of this occurring by using very -large transactions, which each can accommodate many thousands of block freeing -instructions. -Repair moves on to reaping the old blocks, which will be presented in a -subsequent :ref:`section` after a few case studies of bulk loading. - -Case Study: Rebuilding the Inode Index -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The high level process to rebuild the inode index btree is: - -1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` - records from the inode chunk information and a bitmap of the old inode btree - blocks. - -2. Append the records to an xfarray in inode order. - -3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number - of blocks needed for the inode btree. - If the free space inode btree is enabled, call it again to estimate the - geometry of the finobt. - -4. Allocate the number of blocks computed in the previous step. - -5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and - generate the internal node blocks. - If the free space inode btree is enabled, call it again to load the finobt. - -6. Commit the location of the new btree root block(s) to the AGI. - -7. Reap the old btree blocks using the bitmap created in step 1. - -Details are as follows. - -The inode btree maps inumbers to the ondisk location of the associated -inode records, which means that the inode btrees can be rebuilt from the -reverse mapping information. -Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the -location of the old inode btree blocks. -Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the -location of at least one inode cluster buffer. -A cluster is the smallest number of ondisk inodes that can be allocated or -freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. - -For the space represented by each inode cluster, ensure that there are no -records in the free space btrees nor any records in the reference count btree. -If there are, the space metadata inconsistencies are reason enough to abort the -operation. -Otherwise, read each cluster buffer to check that its contents appear to be -ondisk inodes and to decide if the file is allocated -(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). -Accumulate the results of successive inode cluster buffer reads until there is -enough information to fill a single inode chunk record, which is 64 consecutive -numbers in the inumber keyspace. -If the chunk is sparse, the chunk record may include holes. - -Once the repair function accumulates one chunk's worth of data, it calls -``xfarray_append`` to add the inode btree record to the xfarray. -This xfarray is walked twice during the btree creation step -- once to populate -the inode btree with all inode chunk records, and a second time to populate the -free inode btree with records for chunks that have free non-sparse inodes. -The number of records for the inode btree is the number of xfarray records, -but the record count for the free inode btree has to be computed as inode chunk -records are stored in the xfarray. - -The proposed patchset is the -`AG btree repair -`_ -series. - -Case Study: Rebuilding the Space Reference Counts -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Reverse mapping records are used to rebuild the reference count information. -Reference counts are required for correct operation of copy on write for shared -file data. -Imagine the reverse mapping entries as rectangles representing extents of -physical blocks, and that the rectangles can be laid down to allow them to -overlap each other. -From the diagram below, it is apparent that a reference count record must start -or end wherever the height of the stack changes. -In other words, the record emission stimulus is level-triggered:: - - █ ███ - ██ █████ ████ ███ ██████ - ██ ████ ███████████ ████ █████████ - ████████████████████████████████ ███████████ - ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ - 2 1 23 21 3 43 234 2123 1 01 2 3 0 - -The ondisk reference count btree does not store the refcount == 0 cases because -the free space btree already records which blocks are free. -Extents being used to stage copy-on-write operations should be the only records -with refcount == 1. -Single-owner file blocks aren't recorded in either the free space or the -reference count btrees. - -The high level process to rebuild the reference count btree is: - -1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` - records for any space having more than one reverse mapping and add them to - the xfarray. - Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray - because these are extents allocated to stage a copy on write operation and - are tracked in the refcount btree. - - Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old - refcount btree blocks. - -2. Sort the records in physical extent order, putting the CoW staging extents - at the end of the xfarray. - This matches the sorting order of records in the refcount btree. - -3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number - of blocks needed for the new tree. - -4. Allocate the number of blocks computed in the previous step. - -5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and - generate the internal node blocks. - -6. Commit the location of new btree root block to the AGF. - -7. Reap the old btree blocks using the bitmap created in step 1. - -Details are as follows; the same algorithm is used by ``xfs_repair`` to -generate refcount information from reverse mapping records. - -- Until the reverse mapping btree runs out of records: - - - Retrieve the next record from the btree and put it in a bag. - - - Collect all records with the same starting block from the btree and put - them in the bag. - - - While the bag isn't empty: - - - Among the mappings in the bag, compute the lowest block number where the - reference count changes. - This position will be either the starting block number of the next - unprocessed reverse mapping or the next block after the shortest mapping - in the bag. - - - Remove all mappings from the bag that end at this position. - - - Collect all reverse mappings that start at this position from the btree - and put them in the bag. - - - If the size of the bag changed and is greater than one, create a new - refcount record associating the block number range that we just walked to - the size of the bag. - -The bag-like structure in this case is a type 2 xfarray as discussed in the -:ref:`xfarray access patterns` section. -Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and -removed via ``xfarray_unset``. -Bag members are examined through ``xfarray_iter`` loops. - -The proposed patchset is the -`AG btree repair -`_ -series. - -Case Study: Rebuilding File Fork Mapping Indices -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The high level process to rebuild a data/attr fork mapping btree is: - -1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` - records from the reverse mapping records for that inode and fork. - Append these records to an xfarray. - Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` - records. - -2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number - of blocks needed for the new tree. - -3. Sort the records in file offset order. - -4. If the extent records would fit in the inode fork immediate area, commit the - records to that immediate area and skip to step 8. - -5. Allocate the number of blocks computed in the previous step. - -6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and - generate the internal node blocks. - -7. Commit the new btree root block to the inode fork immediate area. - -8. Reap the old btree blocks using the bitmap created in step 1. - -There are some complications here: -First, it's possible to move the fork offset to adjust the sizes of the -immediate areas if the data and attr forks are not both in BMBT format. -Second, if there are sufficiently few fork mappings, it may be possible to use -EXTENTS format instead of BMBT, which may require a conversion. -Third, the incore extent map must be reloaded carefully to avoid disturbing -any delayed allocation extents. - -The proposed patchset is the -`file mapping repair -`_ -series. - -.. _reaping: - -Reaping Old Metadata Blocks ---------------------------- - -Whenever online fsck builds a new data structure to replace one that is -suspect, there is a question of how to find and dispose of the blocks that -belonged to the old structure. -The laziest method of course is not to deal with them at all, but this slowly -leads to service degradations as space leaks out of the filesystem. -Hopefully, someone will schedule a rebuild of the free space information to -plug all those leaks. -Offline repair rebuilds all space metadata after recording the usage of -the files and directories that it decides not to clear, hence it can build new -structures in the discovered free space and avoid the question of reaping. - -As part of a repair, online fsck relies heavily on the reverse mapping records -to find space that is owned by the corresponding rmap owner yet truly free. -Cross referencing rmap records with other rmap records is necessary because -there may be other data structures that also think they own some of those -blocks (e.g. crosslinked trees). -Permitting the block allocator to hand them out again will not push the system -towards consistency. - -For space metadata, the process of finding extents to dispose of generally -follows this format: - -1. Create a bitmap of space used by data structures that must be preserved. - The space reservations used to create the new metadata can be used here if - the same rmap owner code is used to denote all of the objects being rebuilt. - -2. Survey the reverse mapping data to create a bitmap of space owned by the - same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. - -3. Use the bitmap disunion operator to subtract (1) from (2). - The remaining set bits represent candidate extents that could be freed. - The process moves on to step 4 below. - -Repairs for file-based metadata such as extended attributes, directories, -symbolic links, quota files and realtime bitmaps are performed by building a -new structure attached to a temporary file and swapping the forks. -Afterward, the mappings in the old file fork are the candidate blocks for -disposal. - -The process for disposing of old extents is as follows: - -4. For each candidate extent, count the number of reverse mapping records for - the first block in that extent that do not have the same rmap owner for the - data structure being repaired. - - - If zero, the block has a single owner and can be freed. - - - If not, the block is part of a crosslinked structure and must not be - freed. - -5. Starting with the next block in the extent, figure out how many more blocks - have the same zero/nonzero other owner status as that first block. - -6. If the region is crosslinked, delete the reverse mapping entry for the - structure being repaired and move on to the next region. - -7. If the region is to be freed, mark any corresponding buffers in the buffer - cache as stale to prevent log writeback. - -8. Free the region and move on. - -However, there is one complication to this procedure. -Transactions are of finite size, so the reaping process must be careful to roll -the transactions to avoid overruns. -Overruns come from two sources: - -a. EFIs logged on behalf of space that is no longer occupied - -b. Log items for buffer invalidations - -This is also a window in which a crash during the reaping process can leak -blocks. -As stated earlier, online repair functions use very large transactions to -minimize the chances of this occurring. - -The proposed patchset is the -`preparation for bulk loading btrees -`_ -series. - -Case Study: Reaping After a Regular Btree Repair -```````````````````````````````````````````````` - -Old reference count and inode btrees are the easiest to reap because they have -rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount -btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. -Creating a list of extents to reap the old btree blocks is quite simple, -conceptually: - -1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. - -2. For each reverse mapping record with an rmap owner corresponding to the - metadata structure being rebuilt, set the corresponding range in a bitmap. - -3. Walk the current data structures that have the same rmap owner. - For each block visited, clear that range in the above bitmap. - -4. Each set bit in the bitmap represents a block that could be a block from the - old data structures and hence is a candidate for reaping. - In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` - are the blocks that might be freeable. - -If it is possible to maintain the AGF lock throughout the repair (which is the -common case), then step 2 can be performed at the same time as the reverse -mapping record walk that creates the records for the new btree. - -Case Study: Rebuilding the Free Space Indices -````````````````````````````````````````````` - -The high level process to rebuild the free space indices is: - -1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` - records from the gaps in the reverse mapping btree. - -2. Append the records to an xfarray. - -3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number - of blocks needed for each new tree. - -4. Allocate the number of blocks computed in the previous step from the free - space information collected. - -5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and - generate the internal node blocks for the free space by length index. - Call it again for the free space by block number index. - -6. Commit the locations of the new btree root blocks to the AGF. - -7. Reap the old btree blocks by looking for space that is not recorded by the - reverse mapping btree, the new free space btrees, or the AGFL. - -Repairing the free space btrees has three key complications over a regular -btree repair: - -First, free space is not explicitly tracked in the reverse mapping records. -Hence, the new free space records must be inferred from gaps in the physical -space component of the keyspace of the reverse mapping btree. - -Second, free space repairs cannot use the common btree reservation code because -new blocks are reserved out of the free space btrees. -This is impossible when repairing the free space btrees themselves. -However, repair holds the AGF buffer lock for the duration of the free space -index reconstruction, so it can use the collected free space information to -supply the blocks for the new free space btrees. -It is not necessary to back each reserved extent with an EFI because the new -free space btrees are constructed in what the ondisk filesystem thinks is -unowned space. -However, if reserving blocks for the new btrees from the collected free space -information changes the number of free space records, repair must re-estimate -the new free space btree geometry with the new record count until the -reservation is sufficient. -As part of committing the new btrees, repair must ensure that reverse mappings -are created for the reserved blocks and that unused reserved blocks are -inserted into the free space btrees. -Deferrred rmap and freeing operations are used to ensure that this transition -is atomic, similar to the other btree repair functions. - -Third, finding the blocks to reap after the repair is not overly -straightforward. -Blocks for the free space btrees and the reverse mapping btrees are supplied by -the AGFL. -Blocks put onto the AGFL have reverse mapping records with the owner -``XFS_RMAP_OWN_AG``. -This ownership is retained when blocks move from the AGFL into the free space -btrees or the reverse mapping btrees. -When repair walks reverse mapping records to synthesize free space records, it -creates a bitmap (``ag_owner_bitmap``) of all the space claimed by -``XFS_RMAP_OWN_AG`` records. -The repair context maintains a second bitmap corresponding to the rmap btree -blocks and the AGFL blocks (``rmap_agfl_bitmap``). -When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & -~rmap_agfl_bitmap)`` computes the extents that are used by the old free space -btrees. -These blocks can then be reaped using the methods outlined above. - -The proposed patchset is the -`AG btree repair -`_ -series. - -.. _rmap_reap: - -Case Study: Reaping After Repairing Reverse Mapping Btrees -`````````````````````````````````````````````````````````` - -Old reverse mapping btrees are less difficult to reap after a repair. -As mentioned in the previous section, blocks on the AGFL, the two free space -btree blocks, and the reverse mapping btree blocks all have reverse mapping -records with ``XFS_RMAP_OWN_AG`` as the owner. -The full process of gathering reverse mapping records and building a new btree -are described in the case study of -:ref:`live rebuilds of rmap data `, but a crucial point from that -discussion is that the new rmap btree will not contain any records for the old -rmap btree, nor will the old btree blocks be tracked in the free space btrees. -The list of candidate reaping blocks is computed by setting the bits -corresponding to the gaps in the new rmap btree records, and then clearing the -bits corresponding to extents in the free space btrees and the current AGFL -blocks. -The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the -methods outlined above. - -The rest of the process of rebuildng the reverse mapping btree is discussed -in a separate :ref:`case study`. - -The proposed patchset is the -`AG btree repair -`_ -series. - -Case Study: Rebuilding the AGFL -``````````````````````````````` - -The allocation group free block list (AGFL) is repaired as follows: - -1. Create a bitmap for all the space that the reverse mapping data claims is - owned by ``XFS_RMAP_OWN_AG``. - -2. Subtract the space used by the two free space btrees and the rmap btree. - -3. Subtract any space that the reverse mapping data claims is owned by any - other owner, to avoid re-adding crosslinked blocks to the AGFL. - -4. Once the AGFL is full, reap any blocks leftover. - -5. The next operation to fix the freelist will right-size the list. - -See `fs/xfs/scrub/agheader_repair.c `_ for more details. - -Inode Record Repairs --------------------- - -Inode records must be handled carefully, because they have both ondisk records -("dinodes") and an in-memory ("cached") representation. -There is a very high potential for cache coherency issues if online fsck is not -careful to access the ondisk metadata *only* when the ondisk metadata is so -badly damaged that the filesystem cannot load the in-memory representation. -When online fsck wants to open a damaged file for scrubbing, it must use -specialized resource acquisition functions that return either the in-memory -representation *or* a lock on whichever object is necessary to prevent any -update to the ondisk location. - -The only repairs that should be made to the ondisk inode buffers are whatever -is necessary to get the in-core structure loaded. -This means fixing whatever is caught by the inode cluster buffer and inode fork -verifiers, and retrying the ``iget`` operation. -If the second ``iget`` fails, the repair has failed. - -Once the in-memory representation is loaded, repair can lock the inode and can -subject it to comprehensive checks, repairs, and optimizations. -Most inode attributes are easy to check and constrain, or are user-controlled -arbitrary bit patterns; these are both easy to fix. -Dealing with the data and attr fork extent counts and the file block counts is -more complicated, because computing the correct value requires traversing the -forks, or if that fails, leaving the fields invalid and waiting for the fork -fsck functions to run. - -The proposed patchset is the -`inode -`_ -repair series. - -Quota Record Repairs --------------------- - -Similar to inodes, quota records ("dquots") also have both ondisk records and -an in-memory representation, and hence are subject to the same cache coherency -issues. -Somewhat confusingly, both are known as dquots in the XFS codebase. - -The only repairs that should be made to the ondisk quota record buffers are -whatever is necessary to get the in-core structure loaded. -Once the in-memory representation is loaded, the only attributes needing -checking are obviously bad limits and timer values. - -Quota usage counters are checked, repaired, and discussed separately in the -section about :ref:`live quotacheck `. - -The proposed patchset is the -`quota -`_ -repair series. - -.. _fscounters: - -Freezing to Fix Summary Counters --------------------------------- - -Filesystem summary counters track availability of filesystem resources such -as free blocks, free inodes, and allocated inodes. -This information could be compiled by walking the free space and inode indexes, -but this is a slow process, so XFS maintains a copy in the ondisk superblock -that should reflect the ondisk metadata, at least when the filesystem has been -unmounted cleanly. -For performance reasons, XFS also maintains incore copies of those counters, -which are key to enabling resource reservations for active transactions. -Writer threads reserve the worst-case quantities of resources from the -incore counter and give back whatever they don't use at commit time. -It is therefore only necessary to serialize on the superblock when the -superblock is being committed to disk. - -The lazy superblock counter feature introduced in XFS v5 took this even further -by training log recovery to recompute the summary counters from the AG headers, -which eliminated the need for most transactions even to touch the superblock. -The only time XFS commits the summary counters is at filesystem unmount. -To reduce contention even further, the incore counter is implemented as a -percpu counter, which means that each CPU is allocated a batch of blocks from a -global incore counter and can satisfy small allocations from the local batch. - -The high-performance nature of the summary counters makes it difficult for -online fsck to check them, since there is no way to quiesce a percpu counter -while the system is running. -Although online fsck can read the filesystem metadata to compute the correct -values of the summary counters, there's no way to hold the value of a percpu -counter stable, so it's quite possible that the counter will be out of date by -the time the walk is complete. -Earlier versions of online scrub would return to userspace with an incomplete -scan flag, but this is not a satisfying outcome for a system administrator. -For repairs, the in-memory counters must be stabilized while walking the -filesystem metadata to get an accurate reading and install it in the percpu -counter. - -To satisfy this requirement, online fsck must prevent other programs in the -system from initiating new writes to the filesystem, it must disable background -garbage collection threads, and it must wait for existing writer programs to -exit the kernel. -Once that has been established, scrub can walk the AG free space indexes, the -inode btrees, and the realtime bitmap to compute the correct value of all -four summary counters. -This is very similar to a filesystem freeze, though not all of the pieces are -necessary: - -- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to - prevent other threads from thawing the filesystem, or other scrub threads - from initiating another fscounters freeze. - -- It does not quiesce the log. - -With this code in place, it is now possible to pause the filesystem for just -long enough to check and correct the summary counters. - -+--------------------------------------------------------------------------+ -| **Historical Sidebar**: | -+--------------------------------------------------------------------------+ -| The initial implementation used the actual VFS filesystem freeze | -| mechanism to quiesce filesystem activity. | -| With the filesystem frozen, it is possible to resolve the counter values | -| with exact precision, but there are many problems with calling the VFS | -| methods directly: | -| | -| - Other programs can unfreeze the filesystem without our knowledge. | -| This leads to incorrect scan results and incorrect repairs. | -| | -| - Adding an extra lock to prevent others from thawing the filesystem | -| required the addition of a ``->freeze_super`` function to wrap | -| ``freeze_fs()``. | -| This in turn caused other subtle problems because it turns out that | -| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | -| last reference to the VFS superblock, and any subsequent access | -| becomes a UAF bug! | -| This can happen if the filesystem is unmounted while the underlying | -| block device has frozen the filesystem. | -| This problem could be solved by grabbing extra references to the | -| superblock, but it felt suboptimal given the other inadequacies of | -| this approach. | -| | -| - The log need not be quiesced to check the summary counters, but a VFS | -| freeze initiates one anyway. | -| This adds unnecessary runtime to live fscounter fsck operations. | -| | -| - Quiescing the log means that XFS flushes the (possibly incorrect) | -| counters to disk as part of cleaning the log. | -| | -| - A bug in the VFS meant that freeze could complete even when | -| sync_filesystem fails to flush the filesystem and returns an error. | -| This bug was fixed in Linux 5.17. | -+--------------------------------------------------------------------------+ - -The proposed patchset is the -`summary counter cleanup -`_ -series. - -Full Filesystem Scans ---------------------- - -Certain types of metadata can only be checked by walking every file in the -entire filesystem to record observations and comparing the observations against -what's recorded on disk. -Like every other type of online repair, repairs are made by writing those -observations to disk in a replacement structure and committing it atomically. -However, it is not practical to shut down the entire filesystem to examine -hundreds of billions of files because the downtime would be excessive. -Therefore, online fsck must build the infrastructure to manage a live scan of -all the files in the filesystem. -There are two questions that need to be solved to perform a live walk: - -- How does scrub manage the scan while it is collecting data? - -- How does the scan keep abreast of changes being made to the system by other - threads? - -.. _iscan: - -Coordinated Inode Scans -``````````````````````` - -In the original Unix filesystems of the 1970s, each directory entry contained -an index number (*inumber*) which was used as an index into on ondisk array -(*itable*) of fixed-size records (*inodes*) describing a file's attributes and -its data block mapping. -This system is described by J. Lions, `"inode (5659)" -`_ in *Lions' Commentary on -UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South -Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, -`"Implementation of the File System" -`_, from *The UNIX -Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. -1913-4. - -XFS retains most of this design, except now inumbers are search keys over all -the space in the data section filesystem. -They form a continuous keyspace that can be expressed as a 64-bit integer, -though the inodes themselves are sparsely distributed within the keyspace. -Scans proceed in a linear fashion across the inumber keyspace, starting from -``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. -Naturally, a scan through a keyspace requires a scan cursor object to track the -scan progress. -Because this keyspace is sparse, this cursor contains two parts. -The first part of this scan cursor object tracks the inode that will be -examined next; call this the examination cursor. -Somewhat less obviously, the scan cursor object must also track which parts of -the keyspace have already been visited, which is critical for deciding if a -concurrent filesystem update needs to be incorporated into the scan data. -Call this the visited inode cursor. - -Advancing the scan cursor is a multi-step process encapsulated in -``xchk_iscan_iter``: - -1. Lock the AGI buffer of the AG containing the inode pointed to by the visited - inode cursor. - This guarantee that inodes in this AG cannot be allocated or freed while - advancing the cursor. - -2. Use the per-AG inode btree to look up the next inumber after the one that - was just visited, since it may not be keyspace adjacent. - -3. If there are no more inodes left in this AG: - - a. Move the examination cursor to the point of the inumber keyspace that - corresponds to the start of the next AG. - - b. Adjust the visited inode cursor to indicate that it has "visited" the - last possible inode in the current AG's inode keyspace. - XFS inumbers are segmented, so the cursor needs to be marked as having - visited the entire keyspace up to just before the start of the next AG's - inode keyspace. - - c. Unlock the AGI and return to step 1 if there are unexamined AGs in the - filesystem. - - d. If there are no more AGs to examine, set both cursors to the end of the - inumber keyspace. - The scan is now complete. - -4. Otherwise, there is at least one more inode to scan in this AG: - - a. Move the examination cursor ahead to the next inode marked as allocated - by the inode btree. - - b. Adjust the visited inode cursor to point to the inode just prior to where - the examination cursor is now. - Because the scanner holds the AGI buffer lock, no inodes could have been - created in the part of the inode keyspace that the visited inode cursor - just advanced. - -5. Get the incore inode for the inumber of the examination cursor. - By maintaining the AGI buffer lock until this point, the scanner knows that - it was safe to advance the examination cursor across the entire keyspace, - and that it has stabilized this next inode so that it cannot disappear from - the filesystem until the scan releases the incore inode. - -6. Drop the AGI lock and return the incore inode to the caller. - -Online fsck functions scan all files in the filesystem as follows: - -1. Start a scan by calling ``xchk_iscan_start``. - -2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. - If one is provided: - - a. Lock the inode to prevent updates during the scan. - - b. Scan the inode. - - c. While still holding the inode lock, adjust the visited inode cursor - (``xchk_iscan_mark_visited``) to point to this inode. - - d. Unlock and release the inode. - -8. Call ``xchk_iscan_teardown`` to complete the scan. - -There are subtleties with the inode cache that complicate grabbing the incore -inode for the caller. -Obviously, it is an absolute requirement that the inode metadata be consistent -enough to load it into the inode cache. -Second, if the incore inode is stuck in some intermediate state, the scan -coordinator must release the AGI and push the main filesystem to get the inode -back into a loadable state. - -The proposed patches are the -`inode scanner -`_ -series. -The first user of the new functionality is the -`online quotacheck -`_ -series. - -Inode Management -```````````````` - -In regular filesystem code, references to allocated XFS incore inodes are -always obtained (``xfs_iget``) outside of transaction context because the -creation of the incore context for an existing file does not require metadata -updates. -However, it is important to note that references to incore inodes obtained as -part of file creation must be performed in transaction context because the -filesystem must ensure the atomicity of the ondisk inode btree index updates -and the initialization of the actual ondisk inode. - -References to incore inodes are always released (``xfs_irele``) outside of -transaction context because there are a handful of activities that might -require ondisk updates: - -- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode - release. - -- Speculative preallocations need to be unreserved. - -- An unlinked file may have lost its last reference, in which case the entire - file must be inactivated, which involves releasing all of its resources in - the ondisk metadata and freeing the inode. - -These activities are collectively called inode inactivation. -Inactivation has two parts -- the VFS part, which initiates writeback on all -dirty file pages, and the XFS part, which cleans up XFS-specific information -and frees the inode if it was unlinked. -If the inode is unlinked (or unconnected after a file handle operation), the -kernel drops the inode into the inactivation machinery immediately. - -During normal operation, resource acquisition for an update follows this order -to avoid deadlocks: - -1. Inode reference (``iget``). - -2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). - -3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. - -4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that - can update page cache mappings. - -5. Log feature enablement. - -6. Transaction log space grant. - -7. Space on the data and realtime devices for the transaction. - -8. Incore dquot references, if a file is being repaired. - Note that they are not locked, merely acquired. - -9. Inode ``ILOCK`` for file metadata updates. - -10. AG header buffer locks / Realtime metadata inode ILOCK. - -11. Realtime metadata buffer locks, if applicable. - -12. Extent mapping btree blocks, if applicable. - -Resources are often released in the reverse order, though this is not required. -However, online fsck differs from regular XFS operations because it may examine -an object that normally is acquired in a later stage of the locking order, and -then decide to cross-reference the object with an object that is acquired -earlier in the order. -The next few sections detail the specific ways in which online fsck takes care -to avoid deadlocks. - -iget and irele During a Scrub -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -An inode scan performed on behalf of a scrub operation runs in transaction -context, and possibly with resources already locked and bound to it. -This isn't much of a problem for ``iget`` since it can operate in the context -of an existing transaction, as long as all of the bound resources are acquired -before the inode reference in the regular filesystem. - -When the VFS ``iput`` function is given a linked inode with no other -references, it normally puts the inode on an LRU list in the hope that it can -save time if another process re-opens the file before the system runs out -of memory and frees it. -Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` -flag on the inode to cause the kernel to try to drop the inode into the -inactivation machinery immediately. - -In the past, inactivation was always done from the process that dropped the -inode, which was a problem for scrub because scrub may already hold a -transaction, and XFS does not support nesting transactions. -On the other hand, if there is no scrub transaction, it is desirable to drop -otherwise unused inodes immediately to avoid polluting caches. -To capture these nuances, the online fsck code has a separate ``xchk_irele`` -function to set or clear the ``DONTCACHE`` flag to get the required release -behavior. - -Proposed patchsets include fixing -`scrub iget usage -`_ and -`dir iget usage -`_. - -.. _ilocking: - -Locking Inodes -^^^^^^^^^^^^^^ - -In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks -in a well-known order: parent → child when updating the directory tree, and -in numerical order of the addresses of their ``struct inode`` object otherwise. -For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page -faults. -If two MMAPLOCKs must be acquired, they are acquired in numerical order of -the addresses of their ``struct address_space`` objects. -Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be -acquired before transactions are allocated. -If two ILOCKs must be acquired, they are acquired in inumber order. - -Inode lock acquisition must be done carefully during a coordinated inode scan. -Online fsck cannot abide these conventions, because for a directory tree -scanner, the scrub process holds the IOLOCK of the file being scanned and it -needs to take the IOLOCK of the file at the other end of the directory link. -If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` -cannot use the regular inode locking functions and avoid becoming trapped in an -ABBA deadlock. - -Solving both of these problems is straightforward -- any time online fsck -needs to take a second lock of the same class, it uses trylock to avoid an ABBA -deadlock. -If the trylock fails, scrub drops all inode locks and use trylock loops to -(re)acquire all necessary resources. -Trylock loops enable scrub to check for pending fatal signals, which is how -scrub avoids deadlocking the filesystem or becoming an unresponsive process. -However, trylock loops means that online fsck must be prepared to measure the -resource being scrubbed before and after the lock cycle to detect changes and -react accordingly. - -.. _dirparent: - -Case Study: Finding a Directory Parent -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Consider the directory parent pointer repair code as an example. -Online fsck must verify that the dotdot dirent of a directory points up to a -parent directory, and that the parent directory contains exactly one dirent -pointing down to the child directory. -Fully validating this relationship (and repairing it if possible) requires a -walk of every directory on the filesystem while holding the child locked, and -while updates to the directory tree are being made. -The coordinated inode scan provides a way to walk the filesystem without the -possibility of missing an inode. -The child directory is kept locked to prevent updates to the dotdot dirent, but -if the scanner fails to lock a parent, it can drop and relock both the child -and the prospective parent. -If the dotdot entry changes while the directory is unlocked, then a move or -rename operation must have changed the child's parentage, and the scan can -exit early. - -The proposed patchset is the -`directory repair -`_ -series. - -.. _fshooks: - -Filesystem Hooks -````````````````` - -The second piece of support that online fsck functions need during a full -filesystem scan is the ability to stay informed about updates being made by -other threads in the filesystem, since comparisons against the past are useless -in a dynamic environment. -Two pieces of Linux kernel infrastructure enable online fsck to monitor regular -filesystem operations: filesystem hooks and :ref:`static keys`. - -Filesystem hooks convey information about an ongoing filesystem operation to -a downstream consumer. -In this case, the downstream consumer is always an online fsck function. -Because multiple fsck functions can run in parallel, online fsck uses the Linux -notifier call chain facility to dispatch updates to any number of interested -fsck processes. -Call chains are a dynamic list, which means that they can be configured at -run time. -Because these hooks are private to the XFS module, the information passed along -contains exactly what the checking function needs to update its observations. - -The current implementation of XFS hooks uses SRCU notifier chains to reduce the -impact to highly threaded workloads. -Regular blocking notifier chains use a rwsem and seem to have a much lower -overhead for single-threaded applications. -However, it may turn out that the combination of blocking chains and static -keys are a more performant combination; more study is needed here. - -The following pieces are necessary to hook a certain point in the filesystem: - -- A ``struct xfs_hooks`` object must be embedded in a convenient place such as - a well-known incore filesystem object. - -- Each hook must define an action code and a structure containing more context - about the action. - -- Hook providers should provide appropriate wrapper functions and structs - around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type - checking to ensure correct usage. - -- A callsite in the regular filesystem code must be chosen to call - ``xfs_hooks_call`` with the action code and data structure. - This place should be adjacent to (and not earlier than) the place where - the filesystem update is committed to the transaction. - In general, when the filesystem calls a hook chain, it should be able to - handle sleeping and should not be vulnerable to memory reclaim or locking - recursion. - However, the exact requirements are very dependent on the context of the hook - caller and the callee. - -- The online fsck function should define a structure to hold scan data, a lock - to coordinate access to the scan data, and a ``struct xfs_hook`` object. - The scanner function and the regular filesystem code must acquire resources - in the same order; see the next section for details. - -- The online fsck code must contain a C function to catch the hook action code - and data structure. - If the object being updated has already been visited by the scan, then the - hook information must be applied to the scan data. - -- Prior to unlocking inodes to start the scan, online fsck must call - ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and - ``xfs_hooks_add`` to enable the hook. - -- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is - complete. - -The number of hooks should be kept to a minimum to reduce complexity. -Static keys are used to reduce the overhead of filesystem hooks to nearly -zero when online fsck is not running. - -.. _liveupdate: - -Live Updates During a Scan -`````````````````````````` - -The code paths of the online fsck scanning code and the :ref:`hooked` -filesystem code look like this:: - - other program - ↓ - inode lock ←────────────────────┐ - ↓ │ - AG header lock │ - ↓ │ - filesystem function │ - ↓ │ - notifier call chain │ same - ↓ ├─── inode - scrub hook function │ lock - ↓ │ - scan data mutex ←──┐ same │ - ↓ ├─── scan │ - update scan data │ lock │ - ↑ │ │ - scan data mutex ←──┘ │ - ↑ │ - inode lock ←────────────────────┘ - ↑ - scrub function - ↑ - inode scanner - ↑ - xfs_scrub - -These rules must be followed to ensure correct interactions between the -checking code and the code making an update to the filesystem: - -- Prior to invoking the notifier call chain, the filesystem function being - hooked must acquire the same lock that the scrub scanning function acquires - to scan the inode. - -- The scanning function and the scrub hook function must coordinate access to - the scan data by acquiring a lock on the scan data. - -- Scrub hook function must not add the live update information to the scan - observations unless the inode being updated has already been scanned. - The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) - for this. - -- Scrub hook functions must not change the caller's state, including the - transaction that it is running. - They must not acquire any resources that might conflict with the filesystem - function being hooked. - -- The hook function can abort the inode scan to avoid breaking the other rules. - -The inode scan APIs are pretty simple: - -- ``xchk_iscan_start`` starts a scan - -- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or - returns zero if there is nothing left to scan - -- ``xchk_iscan_want_live_update`` to decide if an inode has already been - visited in the scan. - This is critical for hook functions to decide if they need to update the - in-memory scan information. - -- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the - scan - -- ``xchk_iscan_teardown`` to finish the scan - -This functionality is also a part of the -`inode scanner -`_ -series. - -.. _quotacheck: - -Case Study: Quota Counter Checking -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It is useful to compare the mount time quotacheck code to the online repair -quotacheck code. -Mount time quotacheck does not have to contend with concurrent operations, so -it does the following: - -1. Make sure the ondisk dquots are in good enough shape that all the incore - dquots will actually load, and zero the resource usage counters in the - ondisk buffer. - -2. Walk every inode in the filesystem. - Add each file's resource usage to the incore dquot. - -3. Walk each incore dquot. - If the incore dquot is not being flushed, add the ondisk buffer backing the - incore dquot to a delayed write (delwri) list. - -4. Write the buffer list to disk. - -Like most online fsck functions, online quotacheck can't write to regular -filesystem objects until the newly collected metadata reflect all filesystem -state. -Therefore, online quotacheck records file resource usage to a shadow dquot -index implemented with a sparse ``xfarray``, and only writes to the real dquots -once the scan is complete. -Handling transactional updates is tricky because quota resource usage updates -are handled in phases to minimize contention on dquots: - -1. The inodes involved are joined and locked to a transaction. - -2. For each dquot attached to the file: - - a. The dquot is locked. - - b. A quota reservation is added to the dquot's resource usage. - The reservation is recorded in the transaction. - - c. The dquot is unlocked. - -3. Changes in actual quota usage are tracked in the transaction. - -4. At transaction commit time, each dquot is examined again: - - a. The dquot is locked again. - - b. Quota usage changes are logged and unused reservation is given back to - the dquot. - - c. The dquot is unlocked. - -For online quotacheck, hooks are placed in steps 2 and 4. -The step 2 hook creates a shadow version of the transaction dquot context -(``dqtrx``) that operates in a similar manner to the regular code. -The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. -Notice that both hooks are called with the inode locked, which is how the -live update coordinates with the inode scanner. - -The quotacheck scan looks like this: - -1. Set up a coordinated inode scan. - -2. For each inode returned by the inode scan iterator: - - a. Grab and lock the inode. - - b. Determine that inode's resource usage (data blocks, inode counts, - realtime blocks) and add that to the shadow dquots for the user, group, - and project ids associated with the inode. - - c. Unlock and release the inode. - -3. For each dquot in the system: - - a. Grab and lock the dquot. - - b. Check the dquot against the shadow dquots created by the scan and updated - by the live hooks. - -Live updates are key to being able to walk every quota record without -needing to hold any locks for a long duration. -If repairs are desired, the real and shadow dquots are locked and their -resource counts are set to the values in the shadow dquot. - -The proposed patchset is the -`online quotacheck -`_ -series. - -.. _nlinks: - -Case Study: File Link Count Checking -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -File link count checking also uses live update hooks. -The coordinated inode scanner is used to visit all directories on the -filesystem, and per-file link count records are stored in a sparse ``xfarray`` -indexed by inumber. -During the scanning phase, each entry in a directory generates observation -data as follows: - -1. If the entry is a dotdot (``'..'``) entry of the root directory, the - directory's parent link count is bumped because the root directory's dotdot - entry is self referential. - -2. If the entry is a dotdot entry of a subdirectory, the parent's backref - count is bumped. - -3. If the entry is neither a dot nor a dotdot entry, the target file's parent - count is bumped. - -4. If the target is a subdirectory, the parent's child link count is bumped. - -A crucial point to understand about how the link count inode scanner interacts -with the live update hooks is that the scan cursor tracks which *parent* -directories have been scanned. -In other words, the live updates ignore any update about ``A → B`` when A has -not been scanned, even if B has been scanned. -Furthermore, a subdirectory A with a dotdot entry pointing back to B is -accounted as a backref counter in the shadow data for A, since child dotdot -entries affect the parent's link count. -Live update hooks are carefully placed in all parts of the filesystem that -create, change, or remove directory entries, since those operations involve -bumplink and droplink. - -For any file, the correct link count is the number of parents plus the number -of child subdirectories. -Non-directories never have children of any kind. -The backref information is used to detect inconsistencies in the number of -links pointing to child subdirectories and the number of dotdot entries -pointing back. - -After the scan completes, the link count of each file can be checked by locking -both the inode and the shadow data, and comparing the link counts. -A second coordinated inode scan cursor is used for comparisons. -Live updates are key to being able to walk every inode without needing to hold -any locks between inodes. -If repairs are desired, the inode's link count is set to the value in the -shadow information. -If no parents are found, the file must be :ref:`reparented ` to the -orphanage to prevent the file from being lost forever. - -The proposed patchset is the -`file link count repair -`_ -series. - -.. _rmap_repair: - -Case Study: Rebuilding Reverse Mapping Records -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Most repair functions follow the same pattern: lock filesystem resources, -walk the surviving ondisk metadata looking for replacement metadata records, -and use an :ref:`in-memory array ` to store the gathered observations. -The primary advantage of this approach is the simplicity and modularity of the -repair code -- code and data are entirely contained within the scrub module, -do not require hooks in the main filesystem, and are usually the most efficient -in memory use. -A secondary advantage of this repair approach is atomicity -- once the kernel -decides a structure is corrupt, no other threads can access the metadata until -the kernel finishes repairing and revalidating the metadata. - -For repairs going on within a shard of the filesystem, these advantages -outweigh the delays inherent in locking the shard while repairing parts of the -shard. -Unfortunately, repairs to the reverse mapping btree cannot use the "standard" -btree repair strategy because it must scan every space mapping of every fork of -every file in the filesystem, and the filesystem cannot stop. -Therefore, rmap repair foregoes atomicity between scrub and repair. -It combines a :ref:`coordinated inode scanner `, :ref:`live update hooks -`, and an :ref:`in-memory rmap btree ` to complete the -scan for reverse mapping records. - -1. Set up an xfbtree to stage rmap records. - -2. While holding the locks on the AGI and AGF buffers acquired during the - scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW - staging extents, and the internal log. - -3. Set up an inode scanner. - -4. Hook into rmap updates for the AG being repaired so that the live scan data - can receive updates to the rmap btree from the rest of the filesystem during - the file scan. - -5. For each space mapping found in either fork of each file scanned, - decide if the mapping matches the AG of interest. - If so: - - a. Create a btree cursor for the in-memory btree. - - b. Use the rmap code to add the record to the in-memory btree. - - c. Use the :ref:`special commit function ` to write the - xfbtree changes to the xfile. - -6. For each live update received via the hook, decide if the owner has already - been scanned. - If so, apply the live update into the scan data: - - a. Create a btree cursor for the in-memory btree. - - b. Replay the operation into the in-memory btree. - - c. Use the :ref:`special commit function ` to write the - xfbtree changes to the xfile. - This is performed with an empty transaction to avoid changing the - caller's state. - -7. When the inode scan finishes, create a new scrub transaction and relock the - two AG headers. - -8. Compute the new btree geometry using the number of rmap records in the - shadow btree, like all other btree rebuilding functions. - -9. Allocate the number of blocks computed in the previous step. - -10. Perform the usual btree bulk loading and commit to install the new rmap - btree. - -11. Reap the old rmap btree blocks as discussed in the case study about how - to :ref:`reap after rmap btree repair `. - -12. Free the xfbtree now that it not needed. - -The proposed patchset is the -`rmap repair -`_ -series. - -Staging Repairs with Temporary Files on Disk --------------------------------------------- - -XFS stores a substantial amount of metadata in file forks: directories, -extended attributes, symbolic link targets, free space bitmaps and summary -information for the realtime volume, and quota records. -File forks map 64-bit logical file fork space extents to physical storage space -extents, similar to how a memory management unit maps 64-bit virtual addresses -to physical memory addresses. -Therefore, file-based tree structures (such as directories and extended -attributes) use blocks mapped in the file fork offset address space that point -to other blocks mapped within that same address space, and file-based linear -structures (such as bitmaps and quota records) compute array element offsets in -the file fork offset address space. - -Because file forks can consume as much space as the entire filesystem, repairs -cannot be staged in memory, even when a paging scheme is available. -Therefore, online repair of file-based metadata createas a temporary file in -the XFS filesystem, writes a new structure at the correct offsets into the -temporary file, and atomically swaps the fork mappings (and hence the fork -contents) to commit the repair. -Once the repair is complete, the old fork can be reaped as necessary; if the -system goes down during the reap, the iunlink code will delete the blocks -during log recovery. - -**Note**: All space usage and inode indices in the filesystem *must* be -consistent to use a temporary file safely! -This dependency is the reason why online repair can only use pageable kernel -memory to stage ondisk space usage information. - -Swapping metadata extents with a temporary file requires the owner field of the -block headers to match the file being repaired and not the temporary file. The -directory, extended attribute, and symbolic link functions were all modified to -allow callers to specify owner numbers explicitly. - -There is a downside to the reaping process -- if the system crashes during the -reap phase and the fork extents are crosslinked, the iunlink processing will -fail because freeing space will find the extra reverse mappings and abort. - -Temporary files created for repair are similar to ``O_TMPFILE`` files created -by userspace. -They are not linked into a directory and the entire file will be reaped when -the last reference to the file is lost. -The key differences are that these files must have no access permission outside -the kernel at all, they must be specially marked to prevent them from being -opened by handle, and they must never be linked into the directory tree. - -+--------------------------------------------------------------------------+ -| **Historical Sidebar**: | -+--------------------------------------------------------------------------+ -| In the initial iteration of file metadata repair, the damaged metadata | -| blocks would be scanned for salvageable data; the extents in the file | -| fork would be reaped; and then a new structure would be built in its | -| place. | -| This strategy did not survive the introduction of the atomic repair | -| requirement expressed earlier in this document. | -| | -| The second iteration explored building a second structure at a high | -| offset in the fork from the salvage data, reaping the old extents, and | -| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | -| place. | -| | -| This had many drawbacks: | -| | -| - Array structures are linearly addressed, and the regular filesystem | -| codebase does not have the concept of a linear offset that could be | -| applied to the record offset computation to build an alternate copy. | -| | -| - Extended attributes are allowed to use the entire attr fork offset | -| address space. | -| | -| - Even if repair could build an alternate copy of a data structure in a | -| different part of the fork address space, the atomic repair commit | -| requirement means that online repair would have to be able to perform | -| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | -| structure was completely replaced. | -| | -| - A crash after construction of the secondary tree but before the range | -| collapse would leave unreachable blocks in the file fork. | -| This would likely confuse things further. | -| | -| - Reaping blocks after a repair is not a simple operation, and | -| initiating a reap operation from a restarted range collapse operation | -| during log recovery is daunting. | -| | -| - Directory entry blocks and quota records record the file fork offset | -| in the header area of each block. | -| An atomic range collapse operation would have to rewrite this part of | -| each block header. | -| Rewriting a single field in block headers is not a huge problem, but | -| it's something to be aware of. | -| | -| - Each block in a directory or extended attributes btree index contains | -| sibling and child block pointers. | -| Were the atomic commit to use a range collapse operation, each block | -| would have to be rewritten very carefully to preserve the graph | -| structure. | -| Doing this as part of a range collapse means rewriting a large number | -| of blocks repeatedly, which is not conducive to quick repairs. | -| | -| This lead to the introduction of temporary file staging. | -+--------------------------------------------------------------------------+ - -Using a Temporary File -`````````````````````` - -Online repair code should use the ``xrep_tempfile_create`` function to create a -temporary file inside the filesystem. -This allocates an inode, marks the in-core inode private, and attaches it to -the scrub context. -These files are hidden from userspace, may not be added to the directory tree, -and must be kept private. - -Temporary files only use two inode locks: the IOLOCK and the ILOCK. -The MMAPLOCK is not needed here, because there must not be page faults from -userspace for data fork blocks. -The usage patterns of these two locks are the same as for any other XFS file -- -access to file data are controlled via the IOLOCK, and access to file metadata -are controlled via the ILOCK. -Locking helpers are provided so that the temporary file and its lock state can -be cleaned up by the scrub context. -To comply with the nested locking strategy laid out in the :ref:`inode -locking` section, it is recommended that scrub functions use the -xrep_tempfile_ilock*_nowait lock helpers. - -Data can be written to a temporary file by two means: - -1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular - temporary file from an xfile. - -2. The regular directory, symbolic link, and extended attribute functions can - be used to write to the temporary file. - -Once a good copy of a data file has been constructed in a temporary file, it -must be conveyed to the file being repaired, which is the topic of the next -section. - -The proposed patches are in the -`repair temporary files -`_ -series. - -Atomic Extent Swapping ----------------------- - -Once repair builds a temporary file with a new data structure written into -it, it must commit the new changes into the existing file. -It is not possible to swap the inumbers of two files, so instead the new -metadata must replace the old. -This suggests the need for the ability to swap extents, but the existing extent -swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient -for online repair because: - -a. When the reverse-mapping btree is enabled, the swap code must keep the - reverse mapping information up to date with every exchange of mappings. - Therefore, it can only exchange one mapping per transaction, and each - transaction is independent. - -b. Reverse-mapping is critical for the operation of online fsck, so the old - defragmentation code (which swapped entire extent forks in a single - operation) is not useful here. - -c. Defragmentation is assumed to occur between two files with identical - contents. - For this use case, an incomplete exchange will not result in a user-visible - change in file contents, even if the operation is interrupted. - -d. Online repair needs to swap the contents of two files that are by definition - *not* identical. - For directory and xattr repairs, the user-visible contents might be the - same, but the contents of individual blocks may be very different. - -e. Old blocks in the file may be cross-linked with another structure and must - not reappear if the system goes down mid-repair. - -These problems are overcome by creating a new deferred operation and a new type -of log intent item to track the progress of an operation to exchange two file -ranges. -The new deferred operation type chains together the same transactions used by -the reverse-mapping extent swap code. -The new log item records the progress of the exchange to ensure that once an -exchange begins, it will always run to completion, even there are -interruptions. -The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag -in the superblock protects these new log item records from being replayed on -old kernels. - -The proposed patchset is the -`atomic extent swap -`_ -series. - -+--------------------------------------------------------------------------+ -| **Sidebar: Using Log-Incompatible Feature Flags** | -+--------------------------------------------------------------------------+ -| Starting with XFS v5, the superblock contains a | -| ``sb_features_log_incompat`` field to indicate that the log contains | -| records that might not readable by all kernels that could mount this | -| filesystem. | -| In short, log incompat features protect the log contents against kernels | -| that will not understand the contents. | -| Unlike the other superblock feature bits, log incompat bits are | -| ephemeral because an empty (clean) log does not need protection. | -| The log cleans itself after its contents have been committed into the | -| filesystem, either as part of an unmount or because the system is | -| otherwise idle. | -| Because upper level code can be working on a transaction at the same | -| time that the log cleans itself, it is necessary for upper level code to | -| communicate to the log when it is going to use a log incompatible | -| feature. | -| | -| The log coordinates access to incompatible features through the use of | -| one ``struct rw_semaphore`` for each feature. | -| The log cleaning code tries to take this rwsem in exclusive mode to | -| clear the bit; if the lock attempt fails, the feature bit remains set. | -| Filesystem code signals its intention to use a log incompat feature in a | -| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | -| in shared mode. | -| The code supporting a log incompat feature should create wrapper | -| functions to obtain the log feature and call | -| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | -| superblock. | -| The superblock update is performed transactionally, so the wrapper to | -| obtain log assistance must be called just prior to the creation of the | -| transaction that uses the functionality. | -| For a file operation, this step must happen after taking the IOLOCK | -| and the MMAPLOCK, but before allocating the transaction. | -| When the transaction is complete, the ``xlog_drop_incompat_feat`` | -| function is called to release the feature. | -| The feature bit will not be cleared from the superblock until the log | -| becomes clean. | -| | -| Log-assisted extended attribute updates and atomic extent swaps both use | -| log incompat features and provide convenience wrappers around the | -| functionality. | -+--------------------------------------------------------------------------+ - -Mechanics of an Atomic Extent Swap -`````````````````````````````````` - -Swapping entire file forks is a complex task. -The goal is to exchange all file fork mappings between two file fork offset -ranges. -There are likely to be many extent mappings in each fork, and the edges of -the mappings aren't necessarily aligned. -Furthermore, there may be other updates that need to happen after the swap, -such as exchanging file sizes, inode flags, or conversion of fork data to local -format. -This is roughly the format of the new deferred extent swap work item: - -.. code-block:: c - - struct xfs_swapext_intent { - /* Inodes participating in the operation. */ - struct xfs_inode *sxi_ip1; - struct xfs_inode *sxi_ip2; - - /* File offset range information. */ - xfs_fileoff_t sxi_startoff1; - xfs_fileoff_t sxi_startoff2; - xfs_filblks_t sxi_blockcount; - - /* Set these file sizes after the operation, unless negative. */ - xfs_fsize_t sxi_isize1; - xfs_fsize_t sxi_isize2; - - /* XFS_SWAP_EXT_* log operation flags */ - uint64_t sxi_flags; - }; - -The new log intent item contains enough information to track two logical fork -offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, -blockcount)``. -Each step of a swap operation exchanges the largest file range mapping possible -from one file to the other. -After each step in the swap operation, the two startoff fields are incremented -and the blockcount field is decremented to reflect the progress made. -The flags field captures behavioral parameters such as swapping the attr fork -instead of the data fork and other work to be done after the extent swap. -The two isize fields are used to swap the file size at the end of the operation -if the file data fork is the target of the swap operation. - -When the extent swap is initiated, the sequence of operations is as follows: - -1. Create a deferred work item for the extent swap. - At the start, it should contain the entirety of the file ranges to be - swapped. - -2. Call ``xfs_defer_finish`` to process the exchange. - This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. - This will log an extent swap intent item to the transaction for the deferred - extent swap work item. - -3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, - - a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and - ``sxi_startoff2``, respectively, and compute the longest extent that can - be swapped in a single step. - This is the minimum of the two ``br_blockcount`` s in the mappings. - Keep advancing through the file forks until at least one of the mappings - contains written blocks. - Mutual holes, unwritten extents, and extent mappings to the same physical - space are not exchanged. - - For the next few steps, this document will refer to the mapping that came - from file 1 as "map1", and the mapping that came from file 2 as "map2". - - b. Create a deferred block mapping update to unmap map1 from file 1. - - c. Create a deferred block mapping update to unmap map2 from file 2. - - d. Create a deferred block mapping update to map map1 into file 2. - - e. Create a deferred block mapping update to map map2 into file 1. - - f. Log the block, quota, and extent count updates for both files. - - g. Extend the ondisk size of either file if necessary. - - h. Log an extent swap done log item for the extent swap intent log item - that was read at the start of step 3. - - i. Compute the amount of file range that has just been covered. - This quantity is ``(map1.br_startoff + map1.br_blockcount - - sxi_startoff1)``, because step 3a could have skipped holes. - - j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` - by the number of blocks computed in the previous step, and decrease - ``sxi_blockcount`` by the same quantity. - This advances the cursor. - - k. Log a new extent swap intent log item reflecting the advanced state of - the work item. - - l. Return the proper error code (EAGAIN) to the deferred operation manager - to inform it that there is more work to be done. - The operation manager completes the deferred work in steps 3b-3e before - moving back to the start of step 3. - -4. Perform any post-processing. - This will be discussed in more detail in subsequent sections. - -If the filesystem goes down in the middle of an operation, log recovery will -find the most recent unfinished extent swap log intent item and restart from -there. -This is how extent swapping guarantees that an outside observer will either see -the old broken structure or the new one, and never a mismash of both. - -Preparation for Extent Swapping -``````````````````````````````` - -There are a few things that need to be taken care of before initiating an -atomic extent swap operation. -First, regular files require the page cache to be flushed to disk before the -operation begins, and directio writes to be quiesced. -Like any filesystem operation, extent swapping must determine the maximum -amount of disk space and quota that can be consumed on behalf of both files in -the operation, and reserve that quantity of resources to avoid an unrecoverable -out of space failure once it starts dirtying metadata. -The preparation step scans the ranges of both files to estimate: - -- Data device blocks needed to handle the repeated updates to the fork - mappings. -- Change in data and realtime block counts for both files. -- Increase in quota usage for both files, if the two files do not share the - same set of quota ids. -- The number of extent mappings that will be added to each file. -- Whether or not there are partially written realtime extents. - User programs must never be able to access a realtime file extent that maps - to different extents on the realtime volume, which could happen if the - operation fails to run to completion. - -The need for precise estimation increases the run time of the swap operation, -but it is very important to maintain correct accounting. -The filesystem must not run completely out of free space, nor can the extent -swap ever add more extent mappings to a fork than it can support. -Regular users are required to abide the quota limits, though metadata repairs -may exceed quota to resolve inconsistent metadata elsewhere. - -Special Features for Swapping Metadata File Extents -``````````````````````````````````````````````````` - -Extended attributes, symbolic links, and directories can set the fork format to -"local" and treat the fork as a literal area for data storage. -Metadata repairs must take extra steps to support these cases: - -- If both forks are in local format and the fork areas are large enough, the - swap is performed by copying the incore fork contents, logging both forks, - and committing. - The atomic extent swap mechanism is not necessary, since this can be done - with a single transaction. - -- If both forks map blocks, then the regular atomic extent swap is used. - -- Otherwise, only one fork is in local format. - The contents of the local format fork are converted to a block to perform the - swap. - The conversion to block format must be done in the same transaction that - logs the initial extent swap intent log item. - The regular atomic extent swap is used to exchange the mappings. - Special flags are set on the swap operation so that the transaction can be - rolled one more time to convert the second file's fork back to local format - so that the second file will be ready to go as soon as the ILOCK is dropped. - -Extended attributes and directories stamp the owning inode into every block, -but the buffer verifiers do not actually check the inode number! -Although there is no verification, it is still important to maintain -referential integrity, so prior to performing the extent swap, online repair -builds every block in the new data structure with the owner field of the file -being repaired. - -After a successful swap operation, the repair operation must reap the old fork -blocks by processing each fork mapping through the standard :ref:`file extent -reaping ` mechanism that is done post-repair. -If the filesystem should go down during the reap part of the repair, the -iunlink processing at the end of recovery will free both the temporary file and -whatever blocks were not reaped. -However, this iunlink processing omits the cross-link detection of online -repair, and is not completely foolproof. - -Swapping Temporary File Extents -``````````````````````````````` - -To repair a metadata file, online repair proceeds as follows: - -1. Create a temporary repair file. - -2. Use the staging data to write out new contents into the temporary repair - file. - The same fork must be written to as is being repaired. - -3. Commit the scrub transaction, since the swap estimation step must be - completed before transaction reservations are made. - -4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with - the appropriate resource reservations, locks, and fill out a ``struct - xfs_swapext_req`` with the details of the swap operation. - -5. Call ``xrep_tempswap_contents`` to swap the contents. - -6. Commit the transaction to complete the repair. - -.. _rtsummary: - -Case Study: Repairing the Realtime Summary File -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In the "realtime" section of an XFS filesystem, free space is tracked via a -bitmap, similar to Unix FFS. -Each bit in the bitmap represents one realtime extent, which is a multiple of -the filesystem block size between 4KiB and 1GiB in size. -The realtime summary file indexes the number of free extents of a given size to -the offset of the block within the realtime free space bitmap where those free -extents begin. -In other words, the summary file helps the allocator find free extents by -length, similar to what the free space by count (cntbt) btree does for the data -section. - -The summary file itself is a flat file (with no block headers or checksums!) -partitioned into ``log2(total rt extents)`` sections containing enough 32-bit -counters to match the number of blocks in the rt bitmap. -Each counter records the number of free extents that start in that bitmap block -and can satisfy a power-of-two allocation request. - -To check the summary file against the bitmap: - -1. Take the ILOCK of both the realtime bitmap and summary files. - -2. For each free space extent recorded in the bitmap: - - a. Compute the position in the summary file that contains a counter that - represents this free extent. - - b. Read the counter from the xfile. - - c. Increment it, and write it back to the xfile. - -3. Compare the contents of the xfile against the ondisk file. - -To repair the summary file, write the xfile contents into the temporary file -and use atomic extent swap to commit the new contents. -The temporary file is then reaped. - -The proposed patchset is the -`realtime summary repair -`_ -series. - -Case Study: Salvaging Extended Attributes -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In XFS, extended attributes are implemented as a namespaced name-value store. -Values are limited in size to 64KiB, but there is no limit in the number of -names. -The attribute fork is unpartitioned, which means that the root of the attribute -structure is always in logical block zero, but attribute leaf blocks, dabtree -index blocks, and remote value blocks are intermixed. -Attribute leaf blocks contain variable-sized records that associate -user-provided names with the user-provided values. -Values larger than a block are allocated separate extents and written there. -If the leaf information expands beyond a single block, a directory/attribute -btree (``dabtree``) is created to map hashes of attribute names to entries -for fast lookup. - -Salvaging extended attributes is done as follows: - -1. Walk the attr fork mappings of the file being repaired to find the attribute - leaf blocks. - When one is found, - - a. Walk the attr leaf block to find candidate keys. - When one is found, - - 1. Check the name for problems, and ignore the name if there are. - - 2. Retrieve the value. - If that succeeds, add the name and value to the staging xfarray and - xfblob. - -2. If the memory usage of the xfarray and xfblob exceed a certain amount of - memory or there are no more attr fork blocks to examine, unlock the file and - add the staged extended attributes to the temporary file. - -3. Use atomic extent swapping to exchange the new and old extended attribute - structures. - The old attribute blocks are now attached to the temporary file. - -4. Reap the temporary file. - -The proposed patchset is the -`extended attribute repair -`_ -series. - -Fixing Directories ------------------- - -Fixing directories is difficult with currently available filesystem features, -since directory entries are not redundant. -The offline repair tool scans all inodes to find files with nonzero link count, -and then it scans all directories to establish parentage of those linked files. -Damaged files and directories are zapped, and files with no parent are -moved to the ``/lost+found`` directory. -It does not try to salvage anything. - -The best that online repair can do at this time is to read directory data -blocks and salvage any dirents that look plausible, correct link counts, and -move orphans back into the directory tree. -The salvage process is discussed in the case study at the end of this section. -The :ref:`file link count fsck ` code takes care of fixing link counts -and moving orphans to the ``/lost+found`` directory. - -Case Study: Salvaging Directories -````````````````````````````````` - -Unlike extended attributes, directory blocks are all the same size, so -salvaging directories is straightforward: - -1. Find the parent of the directory. - If the dotdot entry is not unreadable, try to confirm that the alleged - parent has a child entry pointing back to the directory being repaired. - Otherwise, walk the filesystem to find it. - -2. Walk the first partition of data fork of the directory to find the directory - entry data blocks. - When one is found, - - a. Walk the directory data block to find candidate entries. - When an entry is found: - - i. Check the name for problems, and ignore the name if there are. - - ii. Retrieve the inumber and grab the inode. - If that succeeds, add the name, inode number, and file type to the - staging xfarray and xblob. - -3. If the memory usage of the xfarray and xfblob exceed a certain amount of - memory or there are no more directory data blocks to examine, unlock the - directory and add the staged dirents into the temporary directory. - Truncate the staging files. - -4. Use atomic extent swapping to exchange the new and old directory structures. - The old directory blocks are now attached to the temporary file. - -5. Reap the temporary file. - -**Future Work Question**: Should repair revalidate the dentry cache when -rebuilding a directory? - -*Answer*: Yes, it should. - -In theory it is necessary to scan all dentry cache entries for a directory to -ensure that one of the following apply: - -1. The cached dentry reflects an ondisk dirent in the new directory. - -2. The cached dentry no longer has a corresponding ondisk dirent in the new - directory and the dentry can be purged from the cache. - -3. The cached dentry no longer has an ondisk dirent but the dentry cannot be - purged. - This is the problem case. - -Unfortunately, the current dentry cache design doesn't provide a means to walk -every child dentry of a specific directory, which makes this a hard problem. -There is no known solution. - -The proposed patchset is the -`directory repair -`_ -series. - -Parent Pointers -``````````````` - -A parent pointer is a piece of file metadata that enables a user to locate the -file's parent directory without having to traverse the directory tree from the -root. -Without them, reconstruction of directory trees is hindered in much the same -way that the historic lack of reverse space mapping information once hindered -reconstruction of filesystem space metadata. -The parent pointer feature, however, makes total directory reconstruction -possible. - -XFS parent pointers include the dirent name and location of the entry within -the parent directory. -In other words, child files use extended attributes to store pointers to -parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. -The directory checking process can be strengthened to ensure that the target of -each dirent also contains a parent pointer pointing back to the dirent. -Likewise, each parent pointer can be checked by ensuring that the target of -each parent pointer is a directory and that it contains a dirent matching -the parent pointer. -Both online and offline repair can use this strategy. - -**Note**: The ondisk format of parent pointers is not yet finalized. - -+--------------------------------------------------------------------------+ -| **Historical Sidebar**: | -+--------------------------------------------------------------------------+ -| Directory parent pointers were first proposed as an XFS feature more | -| than a decade ago by SGI. | -| Each link from a parent directory to a child file is mirrored with an | -| extended attribute in the child that could be used to identify the | -| parent directory. | -| Unfortunately, this early implementation had major shortcomings and was | -| never merged into Linux XFS: | -| | -| 1. The XFS codebase of the late 2000s did not have the infrastructure to | -| enforce strong referential integrity in the directory tree. | -| It did not guarantee that a change in a forward link would always be | -| followed up with the corresponding change to the reverse links. | -| | -| 2. Referential integrity was not integrated into offline repair. | -| Checking and repairs were performed on mounted filesystems without | -| taking any kernel or inode locks to coordinate access. | -| It is not clear how this actually worked properly. | -| | -| 3. The extended attribute did not record the name of the directory entry | -| in the parent, so the SGI parent pointer implementation cannot be | -| used to reconnect the directory tree. | -| | -| 4. Extended attribute forks only support 65,536 extents, which means | -| that parent pointer attribute creation is likely to fail at some | -| point before the maximum file link count is achieved. | -| | -| The original parent pointer design was too unstable for something like | -| a file system repair to depend on. | -| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | -| second implementation that solves all shortcomings of the first. | -| During 2022, Allison introduced log intent items to track physical | -| manipulations of the extended attribute structures. | -| This solves the referential integrity problem by making it possible to | -| commit a dirent update and a parent pointer update in the same | -| transaction. | -| Chandan increased the maximum extent counts of both data and attribute | -| forks, thereby ensuring that the extended attribute structure can grow | -| to handle the maximum hardlink count of any file. | -+--------------------------------------------------------------------------+ - -Case Study: Repairing Directories with Parent Pointers -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Directory rebuilding uses a :ref:`coordinated inode scan ` and -a :ref:`directory entry live update hook ` as follows: - -1. Set up a temporary directory for generating the new directory structure, - an xfblob for storing entry names, and an xfarray for stashing directory - updates. - -2. Set up an inode scanner and hook into the directory entry code to receive - updates on directory operations. - -3. For each parent pointer found in each file scanned, decide if the parent - pointer references the directory of interest. - If so: - - a. Stash an addname entry for this dirent in the xfarray for later. - - b. When finished scanning that file, flush the stashed updates to the - temporary directory. - -4. For each live directory update received via the hook, decide if the child - has already been scanned. - If so: - - a. Stash an addname or removename entry for this dirent update in the - xfarray for later. - We cannot write directly to the temporary directory because hook - functions are not allowed to modify filesystem metadata. - Instead, we stash updates in the xfarray and rely on the scanner thread - to apply the stashed updates to the temporary directory. - -5. When the scan is complete, atomically swap the contents of the temporary - directory and the directory being repaired. - The temporary directory now contains the damaged directory structure. - -6. Reap the temporary directory. - -7. Update the dirent position field of parent pointers as necessary. - This may require the queuing of a substantial number of xattr log intent - items. - -The proposed patchset is the -`parent pointers directory repair -`_ -series. - -**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields -match in the reconstructed directory? - -*Answer*: There are a few ways to solve this problem: - -1. The field could be designated advisory, since the other three values are - sufficient to find the entry in the parent. - However, this makes indexed key lookup impossible while repairs are ongoing. - -2. We could allow creating directory entries at specified offsets, which solves - the referential integrity problem but runs the risk that dirent creation - will fail due to conflicts with the free space in the directory. - - These conflicts could be resolved by appending the directory entry and - amending the xattr code to support updating an xattr key and reindexing the - dabtree, though this would have to be performed with the parent directory - still locked. - -3. Same as above, but remove the old parent pointer entry and add a new one - atomically. - -4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, - which would provide the attr name uniqueness that we require, without - forcing repair code to update the dirent position. - Unfortunately, this requires changes to the xattr code to support attr - names as long as 263 bytes. - -5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → - (name, parent_gen)``. - If the hash is sufficiently resistant to collisions (e.g. sha256) then - this should provide the attr name uniqueness that we require. - Names shorter than 247 bytes could be stored directly. - -Discussion is ongoing under the `parent pointers patch deluge -`_. - -Case Study: Repairing Parent Pointers -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Online reconstruction of a file's parent pointer information works similarly to -directory reconstruction: - -1. Set up a temporary file for generating a new extended attribute structure, - an `xfblob` for storing parent pointer names, and an xfarray for - stashing parent pointer updates. - -2. Set up an inode scanner and hook into the directory entry code to receive - updates on directory operations. - -3. For each directory entry found in each directory scanned, decide if the - dirent references the file of interest. - If so: - - a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray - for later. - - b. When finished scanning the directory, flush the stashed updates to the - temporary directory. - -4. For each live directory update received via the hook, decide if the parent - has already been scanned. - If so: - - a. Stash an addpptr or removepptr entry for this dirent update in the - xfarray for later. - We cannot write parent pointers directly to the temporary file because - hook functions are not allowed to modify filesystem metadata. - Instead, we stash updates in the xfarray and rely on the scanner thread - to apply the stashed parent pointer updates to the temporary file. - -5. Copy all non-parent pointer extended attributes to the temporary file. - -6. When the scan is complete, atomically swap the attribute fork of the - temporary file and the file being repaired. - The temporary file now contains the damaged extended attribute structure. - -7. Reap the temporary file. - -The proposed patchset is the -`parent pointers repair -`_ -series. - -Digression: Offline Checking of Parent Pointers -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Examining parent pointers in offline repair works differently because corrupt -files are erased long before directory tree connectivity checks are performed. -Parent pointer checks are therefore a second pass to be added to the existing -connectivity checks: - -1. After the set of surviving files has been established (i.e. phase 6), - walk the surviving directories of each AG in the filesystem. - This is already performed as part of the connectivity checks. - -2. For each directory entry found, record the name in an xfblob, and store - ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a - per-AG in-memory slab. - -3. For each AG in the filesystem, - - a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and - dirent_pos. - - b. For each inode in the AG, - - 1. Scan the inode for parent pointers. - Record the names in a per-file xfblob, and store ``(parent_inum, - parent_gen, dirent_pos)`` tuples in a per-file slab. - - 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. - - 3. Position one slab cursor at the start of the inode's records in the - per-AG tuple slab. - This should be trivial since the per-AG tuples are in child inumber - order. - - 4. Position a second slab cursor at the start of the per-file tuple slab. - - 5. Iterate the two cursors in lockstep, comparing the parent_ino and - dirent_pos fields of the records under each cursor. - - a. Tuples in the per-AG list but not the per-file list are missing and - need to be written to the inode. - - b. Tuples in the per-file list but not the per-AG list are dangling - and need to be removed from the inode. - - c. For tuples in both lists, update the parent_gen and name components - of the parent pointer if necessary. - -4. Move on to examining link counts, as we do today. - -The proposed patchset is the -`offline parent pointers repair -`_ -series. - -Rebuilding directories from parent pointers in offline repair is very -challenging because it currently uses a single-pass scan of the filesystem -during phase 3 to decide which files are corrupt enough to be zapped. -This scan would have to be converted into a multi-pass scan: - -1. The first pass of the scan zaps corrupt inodes, forks, and attributes - much as it does now. - Corrupt directories are noted but not zapped. - -2. The next pass records parent pointers pointing to the directories noted - as being corrupt in the first pass. - This second pass may have to happen after the phase 4 scan for duplicate - blocks, if phase 4 is also capable of zapping directories. - -3. The third pass resets corrupt directories to an empty shortform directory. - Free space metadata has not been ensured yet, so repair cannot yet use the - directory building code in libxfs. - -4. At the start of phase 6, space metadata have been rebuilt. - Use the parent pointer information recorded during step 2 to reconstruct - the dirents and add them to the now-empty directories. - -This code has not yet been constructed. - -.. _orphanage: - -The Orphanage -------------- - -Filesystems present files as a directed, and hopefully acyclic, graph. -In other words, a tree. -The root of the filesystem is a directory, and each entry in a directory points -downwards either to more subdirectories or to non-directory files. -Unfortunately, a disruption in the directory graph pointers result in a -disconnected graph, which makes files impossible to access via regular path -resolution. - -Without parent pointers, the directory parent pointer online scrub code can -detect a dotdot entry pointing to a parent directory that doesn't have a link -back to the child directory and the file link count checker can detect a file -that isn't pointed to by any directory in the filesystem. -If such a file has a positive link count, the file is an orphan. - -With parent pointers, directories can be rebuilt by scanning parent pointers -and parent pointers can be rebuilt by scanning directories. -This should reduce the incidence of files ending up in ``/lost+found``. - -When orphans are found, they should be reconnected to the directory tree. -Offline fsck solves the problem by creating a directory ``/lost+found`` to -serve as an orphanage, and linking orphan files into the orphanage by using the -inumber as the name. -Reparenting a file to the orphanage does not reset any of its permissions or -ACLs. - -This process is more involved in the kernel than it is in userspace. -The directory and file link count repair setup functions must use the regular -VFS mechanisms to create the orphanage directory with all the necessary -security attributes and dentry cache entries, just like a regular directory -tree modification. - -Orphaned files are adopted by the orphanage as follows: - -1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function - to try to ensure that the lost and found directory actually exists. - This also attaches the orphanage directory to the scrub context. - -2. If the decision is made to reconnect a file, take the IOLOCK of both the - orphanage and the file being reattached. - The ``xrep_orphanage_iolock_two`` function follows the inode locking - strategy discussed earlier. - -3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` - to compute the new name in the orphanage and the block reservation required. - -4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair - transaction. - -5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost - and found, and update the kernel dentry cache. - -The proposed patches are in the -`orphanage adoption -`_ -series. - -6. Userspace Algorithms and Data Structures -=========================================== - -This section discusses the key algorithms and data structures of the userspace -program, ``xfs_scrub``, that provide the ability to drive metadata checks and -repairs in the kernel, verify file data, and look for other potential problems. - -.. _scrubcheck: - -Checking Metadata ------------------ - -Recall the :ref:`phases of fsck work` outlined earlier. -That structure follows naturally from the data dependencies designed into the -filesystem from its beginnings in 1993. -In XFS, there are several groups of metadata dependencies: - -a. Filesystem summary counts depend on consistency within the inode indices, - the allocation group space btrees, and the realtime volume space - information. - -b. Quota resource counts depend on consistency within the quota file data - forks, inode indices, inode records, and the forks of every file on the - system. - -c. The naming hierarchy depends on consistency within the directory and - extended attribute structures. - This includes file link counts. - -d. Directories, extended attributes, and file data depend on consistency within - the file forks that map directory and extended attribute data to physical - storage media. - -e. The file forks depends on consistency within inode records and the space - metadata indices of the allocation groups and the realtime volume. - This includes quota and realtime metadata files. - -f. Inode records depends on consistency within the inode metadata indices. - -g. Realtime space metadata depend on the inode records and data forks of the - realtime metadata inodes. - -h. The allocation group metadata indices (free space, inodes, reference count, - and reverse mapping btrees) depend on consistency within the AG headers and - between all the AG metadata btrees. - -i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support - for online fsck functionality. - -Therefore, a metadata dependency graph is a convenient way to schedule checking -operations in the ``xfs_scrub`` program: - -- Phase 1 checks that the provided path maps to an XFS filesystem and detect - the kernel's scrubbing abilities, which validates group (i). - -- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. - -- Phase 3 scans inodes in parallel. - For each inode, groups (f), (e), and (d) are checked, in that order. - -- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 - may run reliably. - -- Phase 5 starts by checking groups (b) and (c) in parallel before moving on - to checking names. - -- Phase 6 depends on groups (i) through (b) to find file data blocks to verify, - to read them, and to report which blocks of which files are affected. - -- Phase 7 checks group (a), having validated everything else. - -Notice that the data dependencies between groups are enforced by the structure -of the program flow. - -Parallel Inode Scans --------------------- - -An XFS filesystem can easily contain hundreds of millions of inodes. -Given that XFS targets installations with large high-performance storage, -it is desirable to scrub inodes in parallel to minimize runtime, particularly -if the program has been invoked manually from a command line. -This requires careful scheduling to keep the threads as evenly loaded as -possible. - -Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single -workqueue and scheduled a single workqueue item per AG. -Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find -inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough -information to construct file handles. -The file handle was then passed to a function to generate scrub items for each -metadata object of each inode. -This simple algorithm leads to thread balancing problems in phase 3 if the -filesystem contains one AG with a few large sparse files and the rest of the -AGs contain many smaller files. -The inode scan dispatch function was not sufficiently granular; it should have -been dispatching at the level of individual inodes, or, to constrain memory -consumption, inode btree records. - -Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to -avoid this problem with ease by adding a second workqueue. -Just like before, the first workqueue is seeded with one workqueue item per AG, -and it uses INUMBERS to find inode btree chunks. -The second workqueue, however, is configured with an upper bound on the number -of items that can be waiting to be run. -Each inode btree chunk found by the first workqueue's workers are queued to the -second workqueue, and it is this second workqueue that queries BULKSTAT, -creates a file handle, and passes it to a function to generate scrub items for -each metadata object of each inode. -If the second workqueue is too full, the workqueue add function blocks the -first workqueue's workers until the backlog eases. -This doesn't completely solve the balancing problem, but reduces it enough to -move on to more pressing issues. - -The proposed patchsets are the scrub -`performance tweaks -`_ -and the -`inode scan rebalance -`_ -series. - -.. _scrubrepair: - -Scheduling Repairs ------------------- - -During phase 2, corruptions and inconsistencies reported in any AGI header or -inode btree are repaired immediately, because phase 3 relies on proper -functioning of the inode indices to find inodes to scan. -Failed repairs are rescheduled to phase 4. -Problems reported in any other space metadata are deferred to phase 4. -Optimization opportunities are always deferred to phase 4, no matter their -origin. - -During phase 3, corruptions and inconsistencies reported in any part of a -file's metadata are repaired immediately if all space metadata were validated -during phase 2. -Repairs that fail or cannot be repaired immediately are scheduled for phase 4. - -In the original design of ``xfs_scrub``, it was thought that repairs would be -so infrequent that the ``struct xfs_scrub_metadata`` objects used to -communicate with the kernel could also be used as the primary object to -schedule repairs. -With recent increases in the number of optimizations possible for a given -filesystem object, it became much more memory-efficient to track all eligible -repairs for a given filesystem object with a single repair item. -Each repair item represents a single lockable object -- AGs, metadata files, -individual inodes, or a class of summary information. - -Phase 4 is responsible for scheduling a lot of repair work in as quick a -manner as is practical. -The :ref:`data dependencies ` outlined earlier still apply, which -means that ``xfs_scrub`` must try to complete the repair work scheduled by -phase 2 before trying repair work scheduled by phase 3. -The repair process is as follows: - -1. Start a round of repair with a workqueue and enough workers to keep the CPUs - as busy as the user desires. - - a. For each repair item queued by phase 2, - - i. Ask the kernel to repair everything listed in the repair item for a - given filesystem object. - - ii. Make a note if the kernel made any progress in reducing the number - of repairs needed for this object. - - iii. If the object no longer requires repairs, revalidate all metadata - associated with this object. - If the revalidation succeeds, drop the repair item. - If not, requeue the item for more repairs. - - b. If any repairs were made, jump back to 1a to retry all the phase 2 items. - - c. For each repair item queued by phase 3, - - i. Ask the kernel to repair everything listed in the repair item for a - given filesystem object. - - ii. Make a note if the kernel made any progress in reducing the number - of repairs needed for this object. - - iii. If the object no longer requires repairs, revalidate all metadata - associated with this object. - If the revalidation succeeds, drop the repair item. - If not, requeue the item for more repairs. - - d. If any repairs were made, jump back to 1c to retry all the phase 3 items. - -2. If step 1 made any repair progress of any kind, jump back to step 1 to start - another round of repair. - -3. If there are items left to repair, run them all serially one more time. - Complain if the repairs were not successful, since this is the last chance - to repair anything. - -Corruptions and inconsistencies encountered during phases 5 and 7 are repaired -immediately. -Corrupt file data blocks reported by phase 6 cannot be recovered by the -filesystem. - -The proposed patchsets are the -`repair warning improvements -`_, -refactoring of the -`repair data dependency -`_ -and -`object tracking -`_, -and the -`repair scheduling -`_ -improvement series. - -Checking Names for Confusable Unicode Sequences ------------------------------------------------ - -If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of -phase 4, it moves on to phase 5, which checks for suspicious looking names in -the filesystem. -These names consist of the filesystem label, names in directory entries, and -the names of extended attributes. -Like most Unix filesystems, XFS imposes the sparest of constraints on the -contents of a name: - -- Slashes and null bytes are not allowed in directory entries. - -- Null bytes are not allowed in userspace-visible extended attributes. - -- Null bytes are not allowed in the filesystem label. - -Directory entries and attribute keys store the length of the name explicitly -ondisk, which means that nulls are not name terminators. -For this section, the term "naming domain" refers to any place where names are -presented together -- all the names in a directory, or all the attributes of a -file. - -Although the Unix naming constraints are very permissive, the reality of most -modern-day Linux systems is that programs work with Unicode character code -points to support international languages. -These programs typically encode those code points in UTF-8 when interfacing -with the C library because the kernel expects null-terminated names. -In the common case, therefore, names found in an XFS filesystem are actually -UTF-8 encoded Unicode data. - -To maximize its expressiveness, the Unicode standard defines separate control -points for various characters that render similarly or identically in writing -systems around the world. -For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders -identically to "Latin Small Letter A" U+0061 "a". - -The standard also permits characters to be constructed in multiple ways -- -either by using a defined code point, or by combining one code point with -various combining marks. -For example, the character "Angstrom Sign U+212B "Å" can also be expressed -as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" -U+030A "◌̊". -Both sequences render identically. - -Like the standards that preceded it, Unicode also defines various control -characters to alter the presentation of text. -For example, the character "Right-to-Left Override" U+202E can trick some -programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". -A second category of rendering problems involves whitespace characters. -If the character "Zero Width Space" U+200B is encountered in a file name, the -name will render identically to a name that does not have the zero width -space. - -If two names within a naming domain have different byte sequences but render -identically, a user may be confused by it. -The kernel, in its indifference to upper level encoding schemes, permits this. -Most filesystem drivers persist the byte sequence names that are given to them -by the VFS. - -Techniques for detecting confusable names are explained in great detail in -sections 4 and 5 of the -`Unicode Security Mechanisms `_ -document. -When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the -Unicode normalization form NFD in conjunction with the confusable name -detection component of -`libicu `_ -to identify names with a directory or within a file's extended attributes that -could be confused for each other. -Names are also checked for control characters, non-rendering characters, and -mixing of bidirectional characters. -All of these potential issues are reported to the system administrator during -phase 5. - -Media Verification of File Data Extents ---------------------------------------- - -The system administrator can elect to initiate a media scan of all file data -blocks. -This scan after validation of all filesystem metadata (except for the summary -counters) as phase 6. -The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map -to find areas that are allocated to file data fork extents. -Gaps between data fork extents that are smaller than 64k are treated as if -they were data fork extents to reduce the command setup overhead. -When the space map scan accumulates a region larger than 32MB, a media -verification request is sent to the disk as a directio read of the raw block -device. - -If the verification read fails, ``xfs_scrub`` retries with single-block reads -to narrow down the failure to the specific region of the media and recorded. -When it has finished issuing verification requests, it again uses the space -mapping ioctl to map the recorded media errors back to metadata structures -and report what has been lost. -For media errors in blocks owned by files, parent pointers can be used to -construct file paths from inode numbers for user-friendly reporting. - -7. Conclusion and Future Work -============================= - -It is hoped that the reader of this document has followed the designs laid out -in this document and now has some familiarity with how XFS performs online -rebuilding of its metadata indices, and how filesystem users can interact with -that functionality. -Although the scope of this work is daunting, it is hoped that this guide will -make it easier for code readers to understand what has been built, for whom it -has been built, and why. -Please feel free to contact the XFS mailing list with questions. - -FIEXCHANGE_RANGE ----------------- - -As discussed earlier, a second frontend to the atomic extent swap mechanism is -a new ioctl call that userspace programs can use to commit updates to files -atomically. -This frontend has been out for review for several years now, though the -necessary refinements to online repair and lack of customer demand mean that -the proposal has not been pushed very hard. - -Extent Swapping with Regular User Files -``````````````````````````````````````` - -As mentioned earlier, XFS has long had the ability to swap extents between -files, which is used almost exclusively by ``xfs_fsr`` to defragment files. -The earliest form of this was the fork swap mechanism, where the entire -contents of data forks could be exchanged between two files by exchanging the -raw bytes in each inode fork's immediate area. -When XFS v5 came along with self-describing metadata, this old mechanism grew -some log support to continue rewriting the owner fields of BMBT blocks during -log recovery. -When the reverse mapping btree was later added to XFS, the only way to maintain -the consistency of the fork mappings with the reverse mapping index was to -develop an iterative mechanism that used deferred bmap and rmap operations to -swap mappings one at a time. -This mechanism is identical to steps 2-3 from the procedure above except for -the new tracking items, because the atomic extent swap mechanism is an -iteration of an existing mechanism and not something totally novel. -For the narrow case of file defragmentation, the file contents must be -identical, so the recovery guarantees are not much of a gain. - -Atomic extent swapping is much more flexible than the existing swapext -implementations because it can guarantee that the caller never sees a mix of -old and new contents even after a crash, and it can operate on two arbitrary -file fork ranges. -The extra flexibility enables several new use cases: - -- **Atomic commit of file writes**: A userspace process opens a file that it - wants to update. - Next, it opens a temporary file and calls the file clone operation to reflink - the first file's contents into the temporary file. - Writes to the original file should instead be written to the temporary file. - Finally, the process calls the atomic extent swap system call - (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all - of the updates to the original file, or none of them. - -.. _swapext_if_unchanged: - -- **Transactional file updates**: The same mechanism as above, but the caller - only wants the commit to occur if the original file's contents have not - changed. - To make this happen, the calling process snapshots the file modification and - change timestamps of the original file before reflinking its data to the - temporary file. - When the program is ready to commit the changes, it passes the timestamps - into the kernel as arguments to the atomic extent swap system call. - The kernel only commits the changes if the provided timestamps match the - original file. - -- **Emulation of atomic block device writes**: Export a block device with a - logical sector size matching the filesystem block size to force all writes - to be aligned to the filesystem block size. - Stage all writes to a temporary file, and when that is complete, call the - atomic extent swap system call with a flag to indicate that holes in the - temporary file should be ignored. - This emulates an atomic device write in software, and can support arbitrary - scattered writes. - -Vectorized Scrub ----------------- - -As it turns out, the :ref:`refactoring ` of repair items mentioned -earlier was a catalyst for enabling a vectorized scrub system call. -Since 2018, the cost of making a kernel call has increased considerably on some -systems to mitigate the effects of speculative execution attacks. -This incentivizes program authors to make as few system calls as possible to -reduce the number of times an execution path crosses a security boundary. - -With vectorized scrub, userspace pushes to the kernel the identity of a -filesystem object, a list of scrub types to run against that object, and a -simple representation of the data dependencies between the selected scrub -types. -The kernel executes as much of the caller's plan as it can until it hits a -dependency that cannot be satisfied due to a corruption, and tells userspace -how much was accomplished. -It is hoped that ``io_uring`` will pick up enough of this functionality that -online fsck can use that instead of adding a separate vectored scrub system -call to XFS. - -The relevant patchsets are the -`kernel vectorized scrub -`_ -and -`userspace vectorized scrub -`_ -series. - -Quality of Service Targets for Scrub ------------------------------------- - -One serious shortcoming of the online fsck code is that the amount of time that -it can spend in the kernel holding resource locks is basically unbounded. -Userspace is allowed to send a fatal signal to the process which will cause -``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way -for userspace to provide a time budget to the kernel. -Given that the scrub codebase has helpers to detect fatal signals, it shouldn't -be too much work to allow userspace to specify a timeout for a scrub/repair -operation and abort the operation if it exceeds budget. -However, most repair functions have the property that once they begin to touch -ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS -timeout is no longer useful. - -Defragmenting Free Space ------------------------- - -Over the years, many XFS users have requested the creation of a program to -clear a portion of the physical storage underlying a filesystem so that it -becomes a contiguous chunk of free space. -Call this free space defragmenter ``clearspace`` for short. - -The first piece the ``clearspace`` program needs is the ability to read the -reverse mapping index from userspace. -This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. -The second piece it needs is a new fallocate mode -(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and -maps it to a file. -Call this file the "space collector" file. -The third piece is the ability to force an online repair. - -To clear all the metadata out of a portion of physical storage, clearspace -uses the new fallocate map-freespace call to map any free space in that region -to the space collector file. -Next, clearspace finds all metadata blocks in that region by way of -``GETFSMAP`` and issues forced repair requests on the data structure. -This often results in the metadata being rebuilt somewhere that is not being -cleared. -After each relocation, clearspace calls the "map free space" function again to -collect any newly freed space in the region being cleared. - -To clear all the file data out of a portion of the physical storage, clearspace -uses the FSMAP information to find relevant file data blocks. -Having identified a good target, it uses the ``FICLONERANGE`` call on that part -of the file to try to share the physical space with a dummy file. -Cloning the extent means that the original owners cannot overwrite the -contents; any changes will be written somewhere else via copy-on-write. -Clearspace makes its own copy of the frozen extent in an area that is not being -cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap -` feature) to change the target file's data extent -mapping away from the area being cleared. -When all other mappings have been moved, clearspace reflinks the space into the -space collector file so that it becomes unavailable. - -There are further optimizations that could apply to the above algorithm. -To clear a piece of physical storage that has a high sharing factor, it is -strongly desirable to retain this sharing factor. -In fact, these extents should be moved first to maximize sharing factor after -the operation completes. -To make this work smoothly, clearspace needs a new ioctl -(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. -With the refcount information exposed, clearspace can quickly find the longest, -most shared data extents in the filesystem, and target them first. - -**Future Work Question**: How might the filesystem move inode chunks? - -*Answer*: To move inode chunks, Dave Chinner constructed a prototype program -that creates a new file with the old contents and then locklessly runs around -the filesystem updating directory entries. -The operation cannot complete if the filesystem goes down. -That problem isn't totally insurmountable: create an inode remapping table -hidden behind a jump label, and a log item that tracks the kernel walking the -filesystem to update directory entries. -The trouble is, the kernel can't do anything about open files, since it cannot -revoke them. - -**Future Work Question**: Can static keys be used to minimize the cost of -supporting ``revoke()`` on XFS files? - -*Answer*: Yes. -Until the first revocation, the bailout code need not be in the call path at -all. - -The relevant patchsets are the -`kernel freespace defrag -`_ -and -`userspace freespace defrag -`_ -series. - -Shrinking Filesystems ---------------------- - -Removing the end of the filesystem ought to be a simple matter of evacuating -the data and metadata at the end of the filesystem, and handing the freed space -to the shrink code. -That requires an evacuation of the space at end of the filesystem, which is a -use of free space defragmentation! diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst deleted file mode 100644 index a10c4ae6955e..000000000000 --- a/Documentation/filesystems/xfs-self-describing-metadata.rst +++ /dev/null @@ -1,353 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 -.. _xfs_self_describing_metadata: - -============================ -XFS Self Describing Metadata -============================ - -Introduction -============ - -The largest scalability problem facing XFS is not one of algorithmic -scalability, but of verification of the filesystem structure. Scalabilty of the -structures and indexes on disk and the algorithms for iterating them are -adequate for supporting PB scale filesystems with billions of inodes, however it -is this very scalability that causes the verification problem. - -Almost all metadata on XFS is dynamically allocated. The only fixed location -metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all -other metadata structures need to be discovered by walking the filesystem -structure in different ways. While this is already done by userspace tools for -validating and repairing the structure, there are limits to what they can -verify, and this in turn limits the supportable size of an XFS filesystem. - -For example, it is entirely possible to manually use xfs_db and a bit of -scripting to analyse the structure of a 100TB filesystem when trying to -determine the root cause of a corruption problem, but it is still mainly a -manual task of verifying that things like single bit errors or misplaced writes -weren't the ultimate cause of a corruption event. It may take a few hours to a -few days to perform such forensic analysis, so for at this scale root cause -analysis is entirely possible. - -However, if we scale the filesystem up to 1PB, we now have 10x as much metadata -to analyse and so that analysis blows out towards weeks/months of forensic work. -Most of the analysis work is slow and tedious, so as the amount of analysis goes -up, the more likely that the cause will be lost in the noise. Hence the primary -concern for supporting PB scale filesystems is minimising the time and effort -required for basic forensic analysis of the filesystem structure. - - -Self Describing Metadata -======================== - -One of the problems with the current metadata format is that apart from the -magic number in the metadata block, we have no other way of identifying what it -is supposed to be. We can't even identify if it is the right place. Put simply, -you can't look at a single metadata block in isolation and say "yes, it is -supposed to be there and the contents are valid". - -Hence most of the time spent on forensic analysis is spent doing basic -verification of metadata values, looking for values that are in range (and hence -not detected by automated verification checks) but are not correct. Finding and -understanding how things like cross linked block lists (e.g. sibling -pointers in a btree end up with loops in them) are the key to understanding what -went wrong, but it is impossible to tell what order the blocks were linked into -each other or written to disk after the fact. - -Hence we need to record more information into the metadata to allow us to -quickly determine if the metadata is intact and can be ignored for the purpose -of analysis. We can't protect against every possible type of error, but we can -ensure that common types of errors are easily detectable. Hence the concept of -self describing metadata. - -The first, fundamental requirement of self describing metadata is that the -metadata object contains some form of unique identifier in a well known -location. This allows us to identify the expected contents of the block and -hence parse and verify the metadata object. IF we can't independently identify -the type of metadata in the object, then the metadata doesn't describe itself -very well at all! - -Luckily, almost all XFS metadata has magic numbers embedded already - only the -AGFL, remote symlinks and remote attribute blocks do not contain identifying -magic numbers. Hence we can change the on-disk format of all these objects to -add more identifying information and detect this simply by changing the magic -numbers in the metadata objects. That is, if it has the current magic number, -the metadata isn't self identifying. If it contains a new magic number, it is -self identifying and we can do much more expansive automated verification of the -metadata object at runtime, during forensic analysis or repair. - -As a primary concern, self describing metadata needs some form of overall -integrity checking. We cannot trust the metadata if we cannot verify that it has -not been changed as a result of external influences. Hence we need some form of -integrity check, and this is done by adding CRC32c validation to the metadata -block. If we can verify the block contains the metadata it was intended to -contain, a large amount of the manual verification work can be skipped. - -CRC32c was selected as metadata cannot be more than 64k in length in XFS and -hence a 32 bit CRC is more than sufficient to detect multi-bit errors in -metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is -fast. So while CRC32c is not the strongest of possible integrity checks that -could be used, it is more than sufficient for our needs and has relatively -little overhead. Adding support for larger integrity fields and/or algorithms -does really provide any extra value over CRC32c, but it does add a lot of -complexity and so there is no provision for changing the integrity checking -mechanism. - -Self describing metadata needs to contain enough information so that the -metadata block can be verified as being in the correct place without needing to -look at any other metadata. This means it needs to contain location information. -Just adding a block number to the metadata is not sufficient to protect against -mis-directed writes - a write might be misdirected to the wrong LUN and so be -written to the "correct block" of the wrong filesystem. Hence location -information must contain a filesystem identifier as well as a block number. - -Another key information point in forensic analysis is knowing who the metadata -block belongs to. We already know the type, the location, that it is valid -and/or corrupted, and how long ago that it was last modified. Knowing the owner -of the block is important as it allows us to find other related metadata to -determine the scope of the corruption. For example, if we have a extent btree -object, we don't know what inode it belongs to and hence have to walk the entire -filesystem to find the owner of the block. Worse, the corruption could mean that -no owner can be found (i.e. it's an orphan block), and so without an owner field -in the metadata we have no idea of the scope of the corruption. If we have an -owner field in the metadata object, we can immediately do top down validation to -determine the scope of the problem. - -Different types of metadata have different owner identifiers. For example, -directory, attribute and extent tree blocks are all owned by an inode, while -freespace btree blocks are owned by an allocation group. Hence the size and -contents of the owner field are determined by the type of metadata object we are -looking at. The owner information can also identify misplaced writes (e.g. -freespace btree block written to the wrong AG). - -Self describing metadata also needs to contain some indication of when it was -written to the filesystem. One of the key information points when doing forensic -analysis is how recently the block was modified. Correlation of set of corrupted -metadata blocks based on modification times is important as it can indicate -whether the corruptions are related, whether there's been multiple corruption -events that lead to the eventual failure, and even whether there are corruptions -present that the run-time verification is not detecting. - -For example, we can determine whether a metadata object is supposed to be free -space or still allocated if it is still referenced by its owner by looking at -when the free space btree block that contains the block was last written -compared to when the metadata object itself was last written. If the free space -block is more recent than the object and the object's owner, then there is a -very good chance that the block should have been removed from the owner. - -To provide this "written timestamp", each metadata block gets the Log Sequence -Number (LSN) of the most recent transaction it was modified on written into it. -This number will always increase over the life of the filesystem, and the only -thing that resets it is running xfs_repair on the filesystem. Further, by use of -the LSN we can tell if the corrupted metadata all belonged to the same log -checkpoint and hence have some idea of how much modification occurred between -the first and last instance of corrupt metadata on disk and, further, how much -modification occurred between the corruption being written and when it was -detected. - -Runtime Validation -================== - -Validation of self-describing metadata takes place at runtime in two places: - - - immediately after a successful read from disk - - immediately prior to write IO submission - -The verification is completely stateless - it is done independently of the -modification process, and seeks only to check that the metadata is what it says -it is and that the metadata fields are within bounds and internally consistent. -As such, we cannot catch all types of corruption that can occur within a block -as there may be certain limitations that operational state enforces of the -metadata, or there may be corruption of interblock relationships (e.g. corrupted -sibling pointer lists). Hence we still need stateful checking in the main code -body, but in general most of the per-field validation is handled by the -verifiers. - -For read verification, the caller needs to specify the expected type of metadata -that it should see, and the IO completion process verifies that the metadata -object matches what was expected. If the verification process fails, then it -marks the object being read as EFSCORRUPTED. The caller needs to catch this -error (same as for IO errors), and if it needs to take special action due to a -verification error it can do so by catching the EFSCORRUPTED error value. If we -need more discrimination of error type at higher levels, we can define new -error numbers for different errors as necessary. - -The first step in read verification is checking the magic number and determining -whether CRC validating is necessary. If it is, the CRC32c is calculated and -compared against the value stored in the object itself. Once this is validated, -further checks are made against the location information, followed by extensive -object specific metadata validation. If any of these checks fail, then the -buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. - -Write verification is the opposite of the read verification - first the object -is extensively verified and if it is OK we then update the LSN from the last -modification made to the object, After this, we calculate the CRC and insert it -into the object. Once this is done the write IO is allowed to continue. If any -error occurs during this process, the buffer is again marked with a EFSCORRUPTED -error for the higher layers to catch. - -Structures -========== - -A typical on-disk structure needs to contain the following information:: - - struct xfs_ondisk_hdr { - __be32 magic; /* magic number */ - __be32 crc; /* CRC, not logged */ - uuid_t uuid; /* filesystem identifier */ - __be64 owner; /* parent object */ - __be64 blkno; /* location on disk */ - __be64 lsn; /* last modification in log, not logged */ - }; - -Depending on the metadata, this information may be part of a header structure -separate to the metadata contents, or may be distributed through an existing -structure. The latter occurs with metadata that already contains some of this -information, such as the superblock and AG headers. - -Other metadata may have different formats for the information, but the same -level of information is generally provided. For example: - - - short btree blocks have a 32 bit owner (ag number) and a 32 bit block - number for location. The two of these combined provide the same - information as @owner and @blkno in eh above structure, but using 8 - bytes less space on disk. - - - directory/attribute node blocks have a 16 bit magic number, and the - header that contains the magic number has other information in it as - well. hence the additional metadata headers change the overall format - of the metadata. - -A typical buffer read verifier is structured as follows:: - - #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) - - static void - xfs_foo_read_verify( - struct xfs_buf *bp) - { - struct xfs_mount *mp = bp->b_mount; - - if ((xfs_sb_version_hascrc(&mp->m_sb) && - !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), - XFS_FOO_CRC_OFF)) || - !xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - } - } - -The code ensures that the CRC is only checked if the filesystem has CRCs enabled -by checking the superblock of the feature bit, and then if the CRC verifies OK -(or is not needed) it verifies the actual contents of the block. - -The verifier function will take a couple of different forms, depending on -whether the magic number can be used to determine the format of the block. In -the case it can't, the code is structured as follows:: - - static bool - xfs_foo_verify( - struct xfs_buf *bp) - { - struct xfs_mount *mp = bp->b_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; - - if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; - - if (!xfs_sb_version_hascrc(&mp->m_sb)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } - - /* object specific verification checks here */ - - return true; - } - -If there are different magic numbers for the different formats, the verifier -will look like:: - - static bool - xfs_foo_verify( - struct xfs_buf *bp) - { - struct xfs_mount *mp = bp->b_mount; - struct xfs_ondisk_hdr *hdr = bp->b_addr; - - if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { - if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) - return false; - if (bp->b_bn != be64_to_cpu(hdr->blkno)) - return false; - if (hdr->owner == 0) - return false; - } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) - return false; - - /* object specific verification checks here */ - - return true; - } - -Write verifiers are very similar to the read verifiers, they just do things in -the opposite order to the read verifiers. A typical write verifier:: - - static void - xfs_foo_write_verify( - struct xfs_buf *bp) - { - struct xfs_mount *mp = bp->b_mount; - struct xfs_buf_log_item *bip = bp->b_fspriv; - - if (!xfs_foo_verify(bp)) { - XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); - xfs_buf_ioerror(bp, EFSCORRUPTED); - return; - } - - if (!xfs_sb_version_hascrc(&mp->m_sb)) - return; - - - if (bip) { - struct xfs_ondisk_hdr *hdr = bp->b_addr; - hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); - } - xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); - } - -This will verify the internal structure of the metadata before we go any -further, detecting corruptions that have occurred as the metadata has been -modified in memory. If the metadata verifies OK, and CRCs are enabled, we then -update the LSN field (when it was last modified) and calculate the CRC on the -metadata. Once this is done, we can issue the IO. - -Inodes and Dquots -================= - -Inodes and dquots are special snowflakes. They have per-object CRC and -self-identifiers, but they are packed so that there are multiple objects per -buffer. Hence we do not use per-buffer verifiers to do the work of per-object -verification and CRC calculations. The per-buffer verifiers simply perform basic -identification of the buffer - that they contain inodes or dquots, and that -there are magic numbers in all the expected spots. All further CRC and -verification checks are done when each inode is read from or written back to the -buffer. - -The structure of the verifiers and the identifiers checks is very similar to the -buffer code described above. The only difference is where they are called. For -example, inode read verification is done in xfs_inode_from_disk() when the inode -is first read out of the buffer and the struct xfs_inode is instantiated. The -inode is already extensively verified during writeback in xfs_iflush_int, so the -only addition here is to add the LSN and CRC to the inode as it is copied back -into the buffer. - -XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of -the unlinked list modifications check or update CRCs, neither during unlink nor -log recovery. So, it's gone unnoticed until now. This won't matter immediately - -repair will probably complain about it - but it needs to be fixed. diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst new file mode 100644 index 000000000000..ab66c57a5d18 --- /dev/null +++ b/Documentation/filesystems/xfs/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +XFS Filesystem Documentation +============================ + +.. toctree:: + :maxdepth: 2 + :numbered: + + xfs-delayed-logging-design + xfs-maintainer-entry-profile + xfs-self-describing-metadata + xfs-online-fsck-design diff --git a/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst new file mode 100644 index 000000000000..6402ab8e370c --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst @@ -0,0 +1,1087 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +XFS Logging Design +================== + +Preamble +======== + +This document describes the design and algorithms that the XFS journalling +subsystem is based on. This document describes the design and algorithms that +the XFS journalling subsystem is based on so that readers may familiarize +themselves with the general concepts of how transaction processing in XFS works. + +We begin with an overview of transactions in XFS, followed by describing how +transaction reservations are structured and accounted, and then move into how we +guarantee forwards progress for long running transactions with finite initial +reservations bounds. At this point we need to explain how relogging works. With +the basic concepts covered, the design of the delayed logging mechanism is +documented. + + +Introduction +============ + +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata +are atomic and recoverable. For reasons of space and time efficiency, the +logging mechanisms are varied and complex, combining intents, logical and +physical logging mechanisms to provide the necessary recovery guarantees the +filesystem requires. + +Some objects, such as inodes and dquots, are logged in logical format where the +details logged are made up of the changes to in-core structures rather than +on-disk structures. Other objects - typically buffers - have their physical +changes logged. Long running atomic modifications have individual changes +chained together by intents, ensuring that journal recovery can restart and +finish an operation that was only partially done when the system stopped +functioning. + +The reason for these differences is to keep the amount of log space and CPU time +required to process objects being modified as small as possible and hence the +logging overhead as low as possible. Some items are very frequently modified, +and some parts of objects are more frequently modified than others, so keeping +the overhead of metadata logging low is of prime importance. + +The method used to log an item or chain modifications together isn't +particularly important in the scope of this document. It suffices to know that +the method used for logging a particular object or chaining modifications +together are different and are dependent on the object and/or modification being +performed. The logging subsystem only cares that certain specific rules are +followed to guarantee forwards progress and prevent deadlocks. + + +Transactions in XFS +=================== + +XFS has two types of high level transactions, defined by the type of log space +reservation they take. These are known as "one shot" and "permanent" +transactions. Permanent transaction reservations can take reservations that span +commit boundaries, whilst "one shot" transactions are for a single atomic +modification. + +The type and size of reservation must be matched to the modification taking +place. This means that permanent transactions can be used for one-shot +modifications, but one-shot reservations cannot be used for permanent +transactions. + +In the code, a one-shot transaction pattern looks somewhat like this:: + + tp = xfs_trans_alloc() + + + + xfs_trans_commit(tp); + +As items are modified in the transaction, the dirty regions in those items are +tracked via the transaction handle. Once the transaction is committed, all +resources joined to it are released, along with the remaining unused reservation +space that was taken at the transaction allocation time. + +In contrast, a permanent transaction is made up of multiple linked individual +transactions, and the pattern looks like this:: + + tp = xfs_trans_alloc() + xfs_ilock(ip, XFS_ILOCK_EXCL) + + loop { + xfs_trans_ijoin(tp, 0); + + xfs_trans_log_inode(tp, ip); + xfs_trans_roll(&tp); + } + + xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + +While this might look similar to a one-shot transaction, there is an important +difference: xfs_trans_roll() performs a specific operation that links two +transactions together:: + + ntp = xfs_trans_dup(tp); + xfs_trans_commit(tp); + xfs_trans_reserve(ntp); + +This results in a series of "rolling transactions" where the inode is locked +across the entire chain of transactions. Hence while this series of rolling +transactions is running, nothing else can read from or write to the inode and +this provides a mechanism for complex changes to appear atomic from an external +observer's point of view. + +It is important to note that a series of rolling transactions in a permanent +transaction does not form an atomic change in the journal. While each +individual modification is atomic, the chain is *not atomic*. If we crash half +way through, then recovery will only replay up to the last transactional +modification the loop made that was committed to the journal. + +This affects long running permanent transactions in that it is not possible to +predict how much of a long running operation will actually be recovered because +there is no guarantee of how much of the operation reached stale storage. Hence +if a long running operation requires multiple transactions to fully complete, +the high level operation must use intents and deferred operations to guarantee +recovery can complete the operation once the first transactions is persisted in +the on-disk journal. + + +Transactions are Asynchronous +============================= + +In XFS, all high level transactions are asynchronous by default. This means that +xfs_trans_commit() does not guarantee that the modification has been committed +to stable storage when it returns. Hence when a system crashes, not all the +completed transactions will be replayed during recovery. + +However, the logging subsystem does provide global ordering guarantees, such +that if a specific change is seen after recovery, all metadata modifications +that were committed prior to that change will also be seen. + +For single shot operations that need to reach stable storage immediately, or +ensuring that a long running permanent transaction is fully committed once it is +complete, we can explicitly tag a transaction as synchronous. This will trigger +a "log force" to flush the outstanding committed transactions to stable storage +in the journal and wait for that to complete. + +Synchronous transactions are rarely used, however, because they limit logging +throughput to the IO latency limitations of the underlying storage. Instead, we +tend to use log forces to ensure modifications are on stable storage only when +a user operation requires a synchronisation point to occur (e.g. fsync). + + +Transaction Reservations +======================== + +It has been mentioned a number of times now that the logging subsystem needs to +provide a forwards progress guarantee so that no modification ever stalls +because it can't be written to the journal due to a lack of space in the +journal. This is achieved by the transaction reservations that are made when +a transaction is first allocated. For permanent transactions, these reservations +are maintained as part of the transaction rolling mechanism. + +A transaction reservation provides a guarantee that there is physical log space +available to write the modification into the journal before we start making +modifications to objects and items. As such, the reservation needs to be large +enough to take into account the amount of metadata that the change might need to +log in the worst case. This means that if we are modifying a btree in the +transaction, we have to reserve enough space to record a full leaf-to-root split +of the btree. As such, the reservations are quite complex because we have to +take into account all the hidden changes that might occur. + +For example, a user data extent allocation involves allocating an extent from +free space, which modifies the free space trees. That's two btrees. Inserting +the extent into the inode's extent map might require a split of the extent map +btree, which requires another allocation that can modify the free space trees +again. Then we might have to update reverse mappings, which modifies yet +another btree which might require more space. And so on. Hence the amount of +metadata that a "simple" operation can modify can be quite large. + +This "worst case" calculation provides us with the static "unit reservation" +for the transaction that is calculated at mount time. We must guarantee that the +log has this much space available before the transaction is allowed to proceed +so that when we come to write the dirty metadata into the log we don't run out +of log space half way through the write. + +For one-shot transactions, a single unit space reservation is all that is +required for the transaction to proceed. For permanent transactions, however, we +also have a "log count" that affects the size of the reservation that is to be +made. + +While a permanent transaction can get by with a single unit of space +reservation, it is somewhat inefficient to do this as it requires the +transaction rolling mechanism to re-reserve space on every transaction roll. We +know from the implementation of the permanent transactions how many transaction +rolls are likely for the common modifications that need to be made. + +For example, an inode allocation is typically two transactions - one to +physically allocate a free inode chunk on disk, and another to allocate an inode +from an inode chunk that has free inodes in it. Hence for an inode allocation +transaction, we might set the reservation log count to a value of 2 to indicate +that the common/fast path transaction will commit two linked transactions in a +chain. Each time a permanent transaction rolls, it consumes an entire unit +reservation. + +Hence when the permanent transaction is first allocated, the log space +reservation is increased from a single unit reservation to multiple unit +reservations. That multiple is defined by the reservation log count, and this +means we can roll the transaction multiple times before we have to re-reserve +log space when we roll the transaction. This ensures that the common +modifications we make only need to reserve log space once. + +If the log count for a permanent transaction reaches zero, then it needs to +re-reserve physical space in the log. This is somewhat complex, and requires +an understanding of how the log accounts for space that has been reserved. + + +Log Space Accounting +==================== + +The position in the log is typically referred to as a Log Sequence Number (LSN). +The log is circular, so the positions in the log are defined by the combination +of a cycle number - the number of times the log has been overwritten - and the +offset into the log. A LSN carries the cycle in the upper 32 bits and the +offset in the lower 32 bits. The offset is in units of "basic blocks" (512 +bytes). Hence we can do realtively simple LSN based math to keep track of +available space in the log. + +Log space accounting is done via a pair of constructs called "grant heads". The +position of the grant heads is an absolute value, so the amount of space +available in the log is defined by the distance between the position of the +grant head and the current log tail. That is, how much space can be +reserved/consumed before the grant heads would fully wrap the log and overtake +the tail position. + +The first grant head is the "reserve" head. This tracks the byte count of the +reservations currently held by active transactions. It is a purely in-memory +accounting of the space reservation and, as such, actually tracks byte offsets +into the log rather than basic blocks. Hence it technically isn't using LSNs to +represent the log position, but it is still treated like a split {cycle,offset} +tuple for the purposes of tracking reservation space. + +The reserve grant head is used to accurately account for exact transaction +reservations amounts and the exact byte count that modifications actually make +and need to write into the log. The reserve head is used to prevent new +transactions from taking new reservations when the head reaches the current +tail. It will block new reservations in a FIFO queue and as the log tail moves +forward it will wake them in order once sufficient space is available. This FIFO +mechanism ensures no transaction is starved of resources when log space +shortages occur. + +The other grant head is the "write" head. Unlike the reserve head, this grant +head contains an LSN and it tracks the physical space usage in the log. While +this might sound like it is accounting the same state as the reserve grant head +- and it mostly does track exactly the same location as the reserve grant head - +there are critical differences in behaviour between them that provides the +forwards progress guarantees that rolling permanent transactions require. + +These differences when a permanent transaction is rolled and the internal "log +count" reaches zero and the initial set of unit reservations have been +exhausted. At this point, we still require a log space reservation to continue +the next transaction in the sequeunce, but we have none remaining. We cannot +sleep during the transaction commit process waiting for new log space to become +available, as we may end up on the end of the FIFO queue and the items we have +locked while we sleep could end up pinning the tail of the log before there is +enough free space in the log to fulfill all of the pending reservations and +then wake up transaction commit in progress. + +To take a new reservation without sleeping requires us to be able to take a +reservation even if there is no reservation space currently available. That is, +we need to be able to *overcommit* the log reservation space. As has already +been detailed, we cannot overcommit physical log space. However, the reserve +grant head does not track physical space - it only accounts for the amount of +reservations we currently have outstanding. Hence if the reserve head passes +over the tail of the log all it means is that new reservations will be throttled +immediately and remain throttled until the log tail is moved forward far enough +to remove the overcommit and start taking new reservations. In other words, we +can overcommit the reserve head without violating the physical log head and tail +rules. + +As a result, permanent transactions only "regrant" reservation space during +xfs_trans_commit() calls, while the physical log space reservation - tracked by +the write head - is then reserved separately by a call to xfs_log_reserve() +after the commit completes. Once the commit completes, we can sleep waiting for +physical log space to be reserved from the write grant head, but only if one +critical rule has been observed:: + + Code using permanent reservations must always log the items they hold + locked across each transaction they roll in the chain. + +"Re-logging" the locked items on every transaction roll ensures that the items +attached to the transaction chain being rolled are always relocated to the +physical head of the log and so do not pin the tail of the log. If a locked item +pins the tail of the log when we sleep on the write reservation, then we will +deadlock the log as we cannot take the locks needed to write back that item and +move the tail of the log forwards to free up write grant space. Re-logging the +locked items avoids this deadlock and guarantees that the log reservation we are +making cannot self-deadlock. + +If all rolling transactions obey this rule, then they can all make forwards +progress independently because nothing will block the progress of the log +tail moving forwards and hence ensuring that write grant space is always +(eventually) made available to permanent transactions no matter how many times +they roll. + + +Re-logging Explained +==================== + +XFS allows multiple separate modifications to a single object to be carried in +the log at any given time. This allows the log to avoid needing to flush each +change to disk before recording a new change to the object. XFS does this via a +method called "re-logging". Conceptually, this is quite simple - all it requires +is that any new change to the object is recorded with a *new copy* of all the +existing changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction:: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + + E E Y (> X+n+m+o) + F E+F Y+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique allows objects to be moved forward in the log so that +an object being relogged does not prevent the tail of the log from ever moving +forward. This can be seen in the table above by the changing (increasing) LSN +of each subsequent transaction, and it's the technique that allows us to +implement long-running, multiple-commit permanent transactions. + +A typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +It should now also be obvious how relogging and asynchronous transactions go +hand in hand. That is, transactions don't get written to the physical journal +until either a log buffer is filled (a log buffer can hold multiple +transactions) or a synchronous operation forces the log buffers holding the +transactions to disk. This means that XFS is doing aggregation of transactions +in memory - batching them, if you like - to minimise the impact of the log IO on +transaction throughput. + +The limitation on asynchronous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound of outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +========================= + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects written to the +log would greatly reduce the amount of metadata we write to the log, and this +is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formatting the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that it disassociates the amount of outstanding +metadata changes from the size and number of log buffers available. In other +words, instead of there only being a maximum of 2MB of transaction changes not +written to the log at any point in time, there may be a much greater amount +being accumulated in memory. Hence the potential for loss of metadata on a +crash is much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesystem is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +======================= + +Storing Changes +--------------- + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope for deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just took a long time to recognise it. +Put simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use an +allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and rewrite the vector to +point to the memory buffer rather than the object itself, we now have a copy of +the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector:: + + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +After formatting:: + + Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector:: + + Object +---------------------------------------------+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +After formatting:: + + Memory Buffer +-V1-+-V2-+----V3----+ + Vector 1 +----+ + Vector 2 +----+ + Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewriting the vector addresses to point at the memory buffer we end up with a +self-describing object that can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes +---------------- + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs its own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks objects +in transaction commit order, so when an object is relogged it is removed from +its place in the list and re-inserted at the tail. This is entirely arbitrary +and done to make it easy for debugging - the last items in the list are the +ones that are most recently modified. Ordering of the CIL is not necessary for +transactional integrity (as discussed in the next section) so the ordering is +done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints +---------------------------- + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrite the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allows new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detached from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush:: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like:: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing +-------------------------------------- + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting +------------------------------------------------ + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totaling roughly 150 bytes. If we modify +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In +comparison, if we are logging full directory buffers, they are typically 4KB +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total +space. From this, it should be obvious that a static log space reservation is +not particularly flexible and is difficult to select the "optimal value" for +all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space available to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push is checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning +--------------------------------- + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an asymmetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. The result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to its context. + +Just to make matters slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability +--------------------------------------- + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes +----------------- + +The existing log item life cycle is as follows:: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + + + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean. + +With delayed logging, there are new steps inserted into the life cycle:: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + + + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +committing of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst new file mode 100644 index 000000000000..32b6ac4ca9d6 --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst @@ -0,0 +1,194 @@ +XFS Maintainer Entry Profile +============================ + +Overview +-------- +XFS is a well known high-performance filesystem in the Linux kernel. +The aim of this project is to provide and maintain a robust and +performant filesystem. + +Patches are generally merged to the for-next branch of the appropriate +git repository. +After a testing period, the for-next branch is merged to the master +branch. + +Kernel code are merged to the xfs-linux tree[0]. +Userspace code are merged to the xfsprogs tree[1]. +Test cases are merged to the xfstests tree[2]. +Ondisk format documentation are merged to the xfs-documentation tree[3]. + +All patchsets involving XFS *must* be cc'd in their entirety to the mailing +list linux-xfs@vger.kernel.org. + +Roles +----- +There are eight key roles in the XFS project. +A person can take on multiple roles, and a role can be filled by +multiple people. +Anyone taking on a role is advised to check in with themselves and +others on a regular basis about burnout. + +- **Outside Contributor**: Anyone who sends a patch but is not involved + in the XFS project on a regular basis. + These folks are usually people who work on other filesystems or + elsewhere in the kernel community. + +- **Developer**: Someone who is familiar with the XFS codebase enough to + write new code, documentation, and tests. + + Developers can often be found in the IRC channel mentioned by the ``C:`` + entry in the kernel MAINTAINERS file. + +- **Senior Developer**: A developer who is very familiar with at least + some part of the XFS codebase and/or other subsystems in the kernel. + These people collectively decide the long term goals of the project + and nudge the community in that direction. + They should help prioritize development and review work for each release + cycle. + + Senior developers tend to be more active participants in the IRC channel. + +- **Reviewer**: Someone (most likely also a developer) who reads code + submissions to decide: + + 0. Is the idea behind the contribution sound? + 1. Does the idea fit the goals of the project? + 2. Is the contribution designed correctly? + 3. Is the contribution polished? + 4. Can the contribution be tested effectively? + + Reviewers should identify themselves with an ``R:`` entry in the kernel + and fstests MAINTAINERS files. + +- **Testing Lead**: This person is responsible for setting the test + coverage goals of the project, negotiating with developers to decide + on new tests for new features, and making sure that developers and + release managers execute on the testing. + + The testing lead should identify themselves with an ``M:`` entry in + the XFS section of the fstests MAINTAINERS file. + +- **Bug Triager**: Someone who examines incoming bug reports in just + enough detail to identify the person to whom the report should be + forwarded. + + The bug triagers should identify themselves with a ``B:`` entry in + the kernel MAINTAINERS file. + +- **Release Manager**: This person merges reviewed patchsets into an + integration branch, tests the result locally, pushes the branch to a + public git repository, and sends pull requests further upstream. + The release manager is not expected to work on new feature patchsets. + If a developer and a reviewer fail to reach a resolution on some point, + the release manager must have the ability to intervene to try to drive a + resolution. + + The release manager should identify themselves with an ``M:`` entry in + the kernel MAINTAINERS file. + +- **Community Manager**: This person calls and moderates meetings of as many + XFS participants as they can get when mailing list discussions prove + insufficient for collective decisionmaking. + They may also serve as liaison between managers of the organizations + sponsoring work on any part of XFS. + +- **LTS Maintainer**: Someone who backports and tests bug fixes from + uptream to the LTS kernels. + There tend to be six separate LTS trees at any given time. + + The maintainer for a given LTS release should identify themselves with an + ``M:`` entry in the MAINTAINERS file for that LTS tree. + Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that + same file. + +Submission Checklist Addendum +----------------------------- +Please follow these additional rules when submitting to XFS: + +- Patches affecting only the filesystem itself should be based against + the latest -rc or the for-next branch. + These patches will be merged back to the for-next branch. + +- Authors of patches touching other subsystems need to coordinate with + the maintainers of XFS and the relevant subsystems to decide how to + proceed with a merge. + +- Any patchset changing XFS should be cc'd in its entirety to linux-xfs. + Do not send partial patchsets; that makes analysis of the broader + context of the changes unnecessarily difficult. + +- Anyone making kernel changes that have corresponding changes to the + userspace utilities should send the userspace changes as separate + patchsets immediately after the kernel patchsets. + +- Authors of bug fix patches are expected to use fstests[2] to perform + an A/B test of the patch to determine that there are no regressions. + When possible, a new regression test case should be written for + fstests. + +- Authors of new feature patchsets must ensure that fstests will have + appropriate functional and input corner-case test cases for the new + feature. + +- When implementing a new feature, it is strongly suggested that the + developers write a design document to answer the following questions: + + * **What** problem is this trying to solve? + + * **Who** will benefit from this solution, and **where** will they + access it? + + * **How** will this new feature work? This should touch on major data + structures and algorithms supporting the solution at a higher level + than code comments. + + * **What** userspace interfaces are necessary to build off of the new + features? + + * **How** will this work be tested to ensure that it solves the + problems laid out in the design document without causing new + problems? + + The design document should be committed in the kernel documentation + directory. + It may be omitted if the feature is already well known to the + community. + +- Patchsets for the new tests should be submitted as separate patchsets + immediately after the kernel and userspace code patchsets. + +- Changes to the on-disk format of XFS must be described in the ondisk + format document[3] and submitted as a patchset after the fstests + patchsets. + +- Patchsets implementing bug fixes and further code cleanups should put + the bug fixes at the beginning of the series to ease backporting. + +Key Release Cycle Dates +----------------------- +Bug fixes may be sent at any time, though the release manager may decide to +defer a patch when the next merge window is close. + +Code submissions targeting the next merge window should be sent between +-rc1 and -rc6. +This gives the community time to review the changes, to suggest other changes, +and for the author to retest those changes. + +Code submissions also requiring changes to fs/iomap and targeting the +next merge window should be sent between -rc1 and -rc4. +This allows the broader kernel community adequate time to test the +infrastructure changes. + +Review Cadence +-------------- +In general, please wait at least one week before pinging for feedback. +To find reviewers, either consult the MAINTAINERS file, or ask +developers that have Reviewed-by tags for XFS changes to take a look and +offer their opinion. + +References +---------- +| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/ +| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/ +| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/ +| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/ diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst new file mode 100644 index 000000000000..352516feef6f --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -0,0 +1,5315 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfs_online_fsck_design: + +.. + Mapping of heading styles within this document: + Heading 1 uses "====" above and below + Heading 2 uses "====" + Heading 3 uses "----" + Heading 4 uses "````" + Heading 5 uses "^^^^" + Heading 6 uses "~~~~" + Heading 7 uses "...." + + Sections are manually numbered because apparently that's what everyone + does in the kernel. + +====================== +XFS Online Fsck Design +====================== + +This document captures the design of the online filesystem check feature for +XFS. +The purpose of this document is threefold: + +- To help kernel distributors understand exactly what the XFS online fsck + feature is, and issues about which they should be aware. + +- To help people reading the code to familiarize themselves with the relevant + concepts and design points before they start digging into the code. + +- To help developers maintaining the system by capturing the reasons + supporting higher level decision making. + +As the online fsck code is merged, the links in this document to topic branches +will be replaced with links to code. + +This document is licensed under the terms of the GNU Public License, v2. +The primary author is Darrick J. Wong. + +This design document is split into seven parts. +Part 1 defines what fsck tools are and the motivations for writing a new one. +Parts 2 and 3 present a high level overview of how online fsck process works +and how it is tested to ensure correct functionality. +Part 4 discusses the user interface and the intended usage modes of the new +program. +Parts 5 and 6 show off the high level components and how they fit together, and +then present case studies of how each repair function actually works. +Part 7 sums up what has been discussed so far and speculates about what else +might be built atop online fsck. + +.. contents:: Table of Contents + :local: + +1. What is a Filesystem Check? +============================== + +A Unix filesystem has four main responsibilities: + +- Provide a hierarchy of names through which application programs can associate + arbitrary blobs of data for any length of time, + +- Virtualize physical storage media across those names, and + +- Retrieve the named data blobs at any time. + +- Examine resource usage. + +Metadata directly supporting these functions (e.g. files, directories, space +mappings) are sometimes called primary metadata. +Secondary metadata (e.g. reverse mapping and directory parent pointers) support +operations internal to the filesystem, such as internal consistency checking +and reorganization. +Summary metadata, as the name implies, condense information contained in +primary metadata for performance reasons. + +The filesystem check (fsck) tool examines all the metadata in a filesystem +to look for errors. +In addition to looking for obvious metadata corruptions, fsck also +cross-references different types of metadata records with each other to look +for inconsistencies. +People do not like losing data, so most fsck tools also contains some ability +to correct any problems found. +As a word of caution -- the primary goal of most Linux fsck tools is to restore +the filesystem metadata to a consistent state, not to maximize the data +recovered. +That precedent will not be challenged here. + +Filesystems of the 20th century generally lacked any redundancy in the ondisk +format, which means that fsck can only respond to errors by erasing files until +errors are no longer detected. +More recent filesystem designs contain enough redundancy in their metadata that +it is now possible to regenerate data structures when non-catastrophic errors +occur; this capability aids both strategies. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| System administrators avoid data loss by increasing the number of | +| separate storage systems through the creation of backups; and they avoid | +| downtime by increasing the redundancy of each storage system through the | +| creation of RAID arrays. | +| fsck tools address only the first problem. | ++--------------------------------------------------------------------------+ + +TLDR; Show Me the Code! +----------------------- + +Code is posted to the kernel.org git trees as follows: +`kernel changes `_, +`userspace changes `_, and +`QA test changes `_. +Each kernel patchset adding an online repair function will use the same branch +name across the kernel, xfsprogs, and fstests git repos. + +Existing Tools +-------------- + +The online fsck tool described here will be the third tool in the history of +XFS (on Linux) to check and repair filesystems. +Two programs precede it: + +The first program, ``xfs_check``, was created as part of the XFS debugger +(``xfs_db``) and can only be used with unmounted filesystems. +It walks all metadata in the filesystem looking for inconsistencies in the +metadata, though it lacks any ability to repair what it finds. +Due to its high memory requirements and inability to repair things, this +program is now deprecated and will not be discussed further. + +The second program, ``xfs_repair``, was created to be faster and more robust +than the first program. +Like its predecessor, it can only be used with unmounted filesystems. +It uses extent-based in-memory data structures to reduce memory consumption, +and tries to schedule readahead IO appropriately to reduce I/O waiting time +while it scans the metadata of the entire filesystem. +The most important feature of this tool is its ability to respond to +inconsistencies in file metadata and directory tree by erasing things as needed +to eliminate problems. +Space usage metadata are rebuilt from the observed file metadata. + +Problem Statement +----------------- + +The current XFS tools leave several problems unsolved: + +1. **User programs** suddenly **lose access** to the filesystem when unexpected + shutdowns occur as a result of silent corruptions in the metadata. + These occur **unpredictably** and often without warning. + +2. **Users** experience a **total loss of service** during the recovery period + after an **unexpected shutdown** occurs. + +3. **Users** experience a **total loss of service** if the filesystem is taken + offline to **look for problems** proactively. + +4. **Data owners** cannot **check the integrity** of their stored data without + reading all of it. + This may expose them to substantial billing costs when a linear media scan + performed by the storage system administrator might suffice. + +5. **System administrators** cannot **schedule** a maintenance window to deal + with corruptions if they **lack the means** to assess filesystem health + while the filesystem is online. + +6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem + health when doing so requires **manual intervention** and downtime. + +7. **Users** can be tricked into **doing things they do not desire** when + malicious actors **exploit quirks of Unicode** to place misleading names + in directories. + +Given this definition of the problems to be solved and the actors who would +benefit, the proposed solution is a third fsck tool that acts on a running +filesystem. + +This new third program has three components: an in-kernel facility to check +metadata, an in-kernel facility to repair metadata, and a userspace driver +program to drive fsck activity on a live filesystem. +``xfs_scrub`` is the name of the driver program. +The rest of this document presents the goals and use cases of the new fsck +tool, describes its major design points in connection to those goals, and +discusses the similarities and differences with existing tools. + ++--------------------------------------------------------------------------+ +| **Note**: | ++--------------------------------------------------------------------------+ +| Throughout this document, the existing offline fsck tool can also be | +| referred to by its current name "``xfs_repair``". | +| The userspace driver program for the new online fsck tool can be | +| referred to as "``xfs_scrub``". | +| The kernel portion of online fsck that validates metadata is called | +| "online scrub", and portion of the kernel that fixes metadata is called | +| "online repair". | ++--------------------------------------------------------------------------+ + +The naming hierarchy is broken up into objects known as directories and files +and the physical space is split into pieces known as allocation groups. +Sharding enables better performance on highly parallel systems and helps to +contain the damage when corruptions occur. +The division of the filesystem into principal objects (allocation groups and +inodes) means that there are ample opportunities to perform targeted checks and +repairs on a subset of the filesystem. + +While this is going on, other parts continue processing IO requests. +Even if a piece of filesystem metadata can only be regenerated by scanning the +entire system, the scan can still be done in the background while other file +operations continue. + +In summary, online fsck takes advantage of resource sharding and redundant +metadata to enable targeted checking and repair operations while the system +is running. +This capability will be coupled to automatic system management so that +autonomous self-healing of XFS maximizes service availability. + +2. Theory of Operation +====================== + +Because it is necessary for online fsck to lock and scan live metadata objects, +online fsck consists of three separate code components. +The first is the userspace driver program ``xfs_scrub``, which is responsible +for identifying individual metadata items, scheduling work items for them, +reacting to the outcomes appropriately, and reporting results to the system +administrator. +The second and third are in the kernel, which implements functions to check +and repair each type of online fsck work item. + ++------------------------------------------------------------------+ +| **Note**: | ++------------------------------------------------------------------+ +| For brevity, this document shortens the phrase "online fsck work | +| item" to "scrub item". | ++------------------------------------------------------------------+ + +Scrub item types are delineated in a manner consistent with the Unix design +philosophy, which is to say that each item should handle one aspect of a +metadata structure, and handle it well. + +Scope +----- + +In principle, online fsck should be able to check and to repair everything that +the offline fsck program can handle. +However, online fsck cannot be running 100% of the time, which means that +latent errors may creep in after a scrub completes. +If these errors cause the next mount to fail, offline fsck is the only +solution. +This limitation means that maintenance of the offline fsck tool will continue. +A second limitation of online fsck is that it must follow the same resource +sharing and lock acquisition rules as the regular filesystem. +This means that scrub cannot take *any* shortcuts to save time, because doing +so could lead to concurrency problems. +In other words, online fsck is not a complete replacement for offline fsck, and +a complete run of online fsck may take longer than online fsck. +However, both of these limitations are acceptable tradeoffs to satisfy the +different motivations of online fsck, which are to **minimize system downtime** +and to **increase predictability of operation**. + +.. _scrubphases: + +Phases of Work +-------------- + +The userspace driver program ``xfs_scrub`` splits the work of checking and +repairing an entire filesystem into seven phases. +Each phase concentrates on checking specific types of scrub items and depends +on the success of all previous phases. +The seven phases are as follows: + +1. Collect geometry information about the mounted filesystem and computer, + discover the online fsck capabilities of the kernel, and open the + underlying storage devices. + +2. Check allocation group metadata, all realtime volume metadata, and all quota + files. + Each metadata structure is scheduled as a separate scrub item. + If corruption is found in the inode header or inode btree and ``xfs_scrub`` + is permitted to perform repairs, then those scrub items are repaired to + prepare for phase 3. + Repairs are implemented by using the information in the scrub item to + resubmit the kernel scrub call with the repair flag enabled; this is + discussed in the next section. + Optimizations and all other repairs are deferred to phase 4. + +3. Check all metadata of every file in the filesystem. + Each metadata structure is also scheduled as a separate scrub item. + If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, + and there were no problems detected during phase 2, then those scrub items + are repaired immediately. + Optimizations, deferred repairs, and unsuccessful repairs are deferred to + phase 4. + +4. All remaining repairs and scheduled optimizations are performed during this + phase, if the caller permits them. + Before starting repairs, the summary counters are checked and any necessary + repairs are performed so that subsequent repairs will not fail the resource + reservation step due to wildly incorrect summary counters. + Unsuccessful repairs are requeued as long as forward progress on repairs is + made somewhere in the filesystem. + Free space in the filesystem is trimmed at the end of phase 4 if the + filesystem is clean. + +5. By the start of this phase, all primary and secondary filesystem metadata + must be correct. + Summary counters such as the free space counts and quota resource counts + are checked and corrected. + Directory entry names and extended attribute names are checked for + suspicious entries such as control characters or confusing Unicode sequences + appearing in names. + +6. If the caller asks for a media scan, read all allocated and written data + file extents in the filesystem. + The ability to use hardware-assisted data file integrity checking is new + to online fsck; neither of the previous tools have this capability. + If media errors occur, they will be mapped to the owning files and reported. + +7. Re-check the summary counters and presents the caller with a summary of + space usage and file counts. + +This allocation of responsibilities will be :ref:`revisited ` +later in this document. + +Steps for Each Scrub Item +------------------------- + +The kernel scrub code uses a three-step strategy for checking and repairing +the one aspect of a metadata object represented by a scrub item: + +1. The scrub item of interest is checked for corruptions; opportunities for + optimization; and for values that are directly controlled by the system + administrator but look suspicious. + If the item is not corrupt or does not need optimization, resource are + released and the positive scan results are returned to userspace. + If the item is corrupt or could be optimized but the caller does not permit + this, resources are released and the negative scan results are returned to + userspace. + Otherwise, the kernel moves on to the second step. + +2. The repair function is called to rebuild the data structure. + Repair functions generally choose rebuild a structure from other metadata + rather than try to salvage the existing structure. + If the repair fails, the scan results from the first step are returned to + userspace. + Otherwise, the kernel moves on to the third step. + +3. In the third step, the kernel runs the same checks over the new metadata + item to assess the efficacy of the repairs. + The results of the reassessment are returned to userspace. + +Classification of Metadata +-------------------------- + +Each type of metadata object (and therefore each type of scrub item) is +classified as follows: + +Primary Metadata +```````````````` + +Metadata structures in this category should be most familiar to filesystem +users either because they are directly created by the user or they index +objects created by the user +Most filesystem objects fall into this class: + +- Free space and reference count information + +- Inode records and indexes + +- Storage mapping information for file data + +- Directories + +- Extended attributes + +- Symbolic links + +- Quota limits + +Scrub obeys the same rules as regular filesystem accesses for resource and lock +acquisition. + +Primary metadata objects are the simplest for scrub to process. +The principal filesystem object (either an allocation group or an inode) that +owns the item being scrubbed is locked to guard against concurrent updates. +The check function examines every record associated with the type for obvious +errors and cross-references healthy records against other metadata to look for +inconsistencies. +Repairs for this class of scrub item are simple, since the repair function +starts by holding all the resources acquired in the previous step. +The repair function scans available metadata as needed to record all the +observations needed to complete the structure. +Next, it stages the observations in a new ondisk structure and commits it +atomically to complete the repair. +Finally, the storage from the old data structure are carefully reaped. + +Because ``xfs_scrub`` locks a primary object for the duration of the repair, +this is effectively an offline repair operation performed on a subset of the +filesystem. +This minimizes the complexity of the repair code because it is not necessary to +handle concurrent updates from other threads, nor is it necessary to access +any other part of the filesystem. +As a result, indexed structures can be rebuilt very quickly, and programs +trying to access the damaged structure will be blocked until repairs complete. +The only infrastructure needed by the repair code are the staging area for +observations and a means to write new structures to disk. +Despite these limitations, the advantage that online repair holds is clear: +targeted work on individual shards of the filesystem avoids total loss of +service. + +This mechanism is described in section 2.1 ("Off-Line Algorithm") of +V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction +Algorithms" `_, +*Extending Database Technology*, pp. 293-309, 1992. + +Most primary metadata repair functions stage their intermediate results in an +in-memory array prior to formatting the new ondisk structure, which is very +similar to the list-based algorithm discussed in section 2.3 ("List-Based +Algorithms") of Srinivasan. +However, any data structure builder that maintains a resource lock for the +duration of the repair is *always* an offline algorithm. + +.. _secondary_metadata: + +Secondary Metadata +`````````````````` + +Metadata structures in this category reflect records found in primary metadata, +but are only needed for online fsck or for reorganization of the filesystem. + +Secondary metadata include: + +- Reverse mapping information + +- Directory parent pointers + +This class of metadata is difficult for scrub to process because scrub attaches +to the secondary object but needs to check primary metadata, which runs counter +to the usual order of resource acquisition. +Frequently, this means that full filesystems scans are necessary to rebuild the +metadata. +Check functions can be limited in scope to reduce runtime. +Repairs, however, require a full scan of primary metadata, which can take a +long time to complete. +Under these conditions, ``xfs_scrub`` cannot lock resources for the entire +duration of the repair. + +Instead, repair functions set up an in-memory staging structure to store +observations. +Depending on the requirements of the specific repair function, the staging +index will either have the same format as the ondisk structure or a design +specific to that repair function. +The next step is to release all locks and start the filesystem scan. +When the repair scanner needs to record an observation, the staging data are +locked long enough to apply the update. +While the filesystem scan is in progress, the repair function hooks the +filesystem so that it can apply pending filesystem updates to the staging +information. +Once the scan is done, the owning object is re-locked, the live data is used to +write a new ondisk structure, and the repairs are committed atomically. +The hooks are disabled and the staging staging area is freed. +Finally, the storage from the old data structure are carefully reaped. + +Introducing concurrency helps online repair avoid various locking problems, but +comes at a high cost to code complexity. +Live filesystem code has to be hooked so that the repair function can observe +updates in progress. +The staging area has to become a fully functional parallel structure so that +updates can be merged from the hooks. +Finally, the hook, the filesystem scan, and the inode locking model must be +sufficiently well integrated that a hook event can decide if a given update +should be applied to the staging structure. + +In theory, the scrub implementation could apply these same techniques for +primary metadata, but doing so would make it massively more complex and less +performant. +Programs attempting to access the damaged structures are not blocked from +operation, which may cause application failure or an unplanned filesystem +shutdown. + +Inspiration for the secondary metadata repair strategy was drawn from section +2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") +and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for +Creating Indexes for Very Large Tables Without Quiescing Updates" +`_, 1992. + +The sidecar index mentioned above bears some resemblance to the side file +method mentioned in Srinivasan and Mohan. +Their method consists of an index builder that extracts relevant record data to +build the new structure as quickly as possible; and an auxiliary structure that +captures all updates that would be committed to the index by other threads were +the new index already online. +After the index building scan finishes, the updates recorded in the side file +are applied to the new index. +To avoid conflicts between the index builder and other writer threads, the +builder maintains a publicly visible cursor that tracks the progress of the +scan through the record space. +To avoid duplication of work between the side file and the index builder, side +file updates are elided when the record ID for the update is greater than the +cursor position within the record ID space. + +To minimize changes to the rest of the codebase, XFS online repair keeps the +replacement index hidden until it's completely ready to go. +In other words, there is no attempt to expose the keyspace of the new index +while repair is running. +The complexity of such an approach would be very high and perhaps more +appropriate to building *new* indices. + +**Future Work Question**: Can the full scan and live update code used to +facilitate a repair also be used to implement a comprehensive check? + +*Answer*: In theory, yes. Check would be much stronger if each scrub function +employed these live scans to build a shadow copy of the metadata and then +compared the shadow records to the ondisk records. +However, doing that is a fair amount more work than what the checking functions +do now. +The live scans and hooks were developed much later. +That in turn increases the runtime of those scrub functions. + +Summary Information +``````````````````` + +Metadata structures in this last category summarize the contents of primary +metadata records. +These are often used to speed up resource usage queries, and are many times +smaller than the primary metadata which they represent. + +Examples of summary information include: + +- Summary counts of free space and inodes + +- File link counts from directories + +- Quota resource usage counts + +Check and repair require full filesystem scans, but resource and lock +acquisition follow the same paths as regular filesystem accesses. + +The superblock summary counters have special requirements due to the underlying +implementation of the incore counters, and will be treated separately. +Check and repair of the other types of summary counters (quota resource counts +and file link counts) employ the same filesystem scanning and hooking +techniques as outlined above, but because the underlying data are sets of +integer counters, the staging data need not be a fully functional mirror of the +ondisk structure. + +Inspiration for quota and file link count repair strategies were drawn from +sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View +Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views +and Their Indexes" +`_, 2011. + +Since quotas are non-negative integer counts of resource usage, online +quotacheck can use the incremental view deltas described in section 2.14 to +track pending changes to the block and inode usage counts in each transaction, +and commit those changes to a dquot side file when the transaction commits. +Delta tracking is necessary for dquots because the index builder scans inodes, +whereas the data structure being rebuilt is an index of dquots. +Link count checking combines the view deltas and commit step into one because +it sets attributes of the objects being scanned instead of writing them to a +separate data structure. +Each online fsck function will be discussed as case studies later in this +document. + +Risk Management +--------------- + +During the development of online fsck, several risk factors were identified +that may make the feature unsuitable for certain distributors and users. +Steps can be taken to mitigate or eliminate those risks, though at a cost to +functionality. + +- **Decreased performance**: Adding metadata indices to the filesystem + increases the time cost of persisting changes to disk, and the reverse space + mapping and directory parent pointers are no exception. + System administrators who require the maximum performance can disable the + reverse mapping features at format time, though this choice dramatically + reduces the ability of online fsck to find inconsistencies and repair them. + +- **Incorrect repairs**: As with all software, there might be defects in the + software that result in incorrect repairs being written to the filesystem. + Systematic fuzz testing (detailed in the next section) is employed by the + authors to find bugs early, but it might not catch everything. + The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` + and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to + accept this risk. + The xfsprogs build system has a configure option (``--enable-scrub=no``) that + disables building of the ``xfs_scrub`` binary, though this is not a risk + mitigation if the kernel functionality remains enabled. + +- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be + repairable. + If the keyspaces of several metadata indices overlap in some manner but a + coherent narrative cannot be formed from records collected, then the repair + fails. + To reduce the chance that a repair will fail with a dirty transaction and + render the filesystem unusable, the online repair functions have been + designed to stage and validate all new records before committing the new + structure. + +- **Misbehavior**: Online fsck requires many privileges -- raw IO to block + devices, opening files by handle, ignoring Unix discretionary access control, + and the ability to perform administrative changes. + Running this automatically in the background scares people, so the systemd + background service is configured to run with only the privileges required. + Obviously, this cannot address certain problems like the kernel crashing or + deadlocking, but it should be sufficient to prevent the scrub process from + escaping and reconfiguring the system. + The cron job does not have this protection. + +- **Fuzz Kiddiez**: There are many people now who seem to think that running + automated fuzz testing of ondisk artifacts to find mischievous behavior and + spraying exploit code onto the public mailing list for instant zero-day + disclosure is somehow of some social benefit. + In the view of this author, the benefit is realized only when the fuzz + operators help to **fix** the flaws, but this opinion apparently is not + widely shared among security "researchers". + The XFS maintainers' continuing ability to manage these events presents an + ongoing risk to the stability of the development process. + Automated testing should front-load some of the risk while the feature is + considered EXPERIMENTAL. + +Many of these risks are inherent to software programming. +Despite this, it is hoped that this new functionality will prove useful in +reducing unexpected downtime. + +3. Testing Plan +=============== + +As stated before, fsck tools have three main goals: + +1. Detect inconsistencies in the metadata; + +2. Eliminate those inconsistencies; and + +3. Minimize further loss of data. + +Demonstrations of correct operation are necessary to build users' confidence +that the software behaves within expectations. +Unfortunately, it was not really feasible to perform regular exhaustive testing +of every aspect of a fsck tool until the introduction of low-cost virtual +machines with high-IOPS storage. +With ample hardware availability in mind, the testing strategy for the online +fsck project involves differential analysis against the existing fsck tools and +systematic testing of every attribute of every type of metadata object. +Testing can be split into four major categories, as discussed below. + +Integrated Testing with fstests +------------------------------- + +The primary goal of any free software QA effort is to make testing as +inexpensive and widespread as possible to maximize the scaling advantages of +community. +In other words, testing should maximize the breadth of filesystem configuration +scenarios and hardware setups. +This improves code quality by enabling the authors of online fsck to find and +fix bugs early, and helps developers of new features to find integration +issues earlier in their development effort. + +The Linux filesystem community shares a common QA testing suite, +`fstests `_, for +functional and regression testing. +Even before development work began on online fsck, fstests (when run on XFS) +would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and +scratch filesystems between each test. +This provides a level of assurance that the kernel and the fsck tools stay in +alignment about what constitutes consistent metadata. +During development of the online checking code, fstests was modified to run +``xfs_scrub -n`` between each test to ensure that the new checking code +produces the same results as the two existing fsck tools. + +To start development of online repair, fstests was modified to run +``xfs_repair`` to rebuild the filesystem's metadata indices between tests. +This ensures that offline repair does not crash, leave a corrupt filesystem +after it exists, or trigger complaints from the online check. +This also established a baseline for what can and cannot be repaired offline. +To complete the first phase of development of online repair, fstests was +modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. +This enables a comparison of the effectiveness of online repair as compared to +the existing offline repair tools. + +General Fuzz Testing of Metadata Blocks +--------------------------------------- + +XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. + +Before development of online fsck even began, a set of fstests were created +to test the rather common fault that entire metadata blocks get corrupted. +This required the creation of fstests library code that can create a filesystem +containing every possible type of metadata object. +Next, individual test cases were created to create a test filesystem, identify +a single block of a specific type of metadata object, trash it with the +existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a +particular metadata validation strategy. + +This earlier test suite enabled XFS developers to test the ability of the +in-kernel validation functions and the ability of the offline fsck tool to +detect and eliminate the inconsistent metadata. +This part of the test suite was extended to cover online fsck in exactly the +same manner. + +In other words, for a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem: + + * Write garbage to it + + * Test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline repair (``xfs_repair``) to detect and fix + 3. Online repair (``xfs_scrub``) to detect and fix + +Targeted Fuzz Testing of Metadata Records +----------------------------------------- + +The testing plan for online fsck includes extending the existing fs testing +infrastructure to provide a much more powerful facility: targeted fuzz testing +of every metadata field of every metadata object in the filesystem. +``xfs_db`` can modify every field of every metadata structure in every +block in the filesystem to simulate the effects of memory corruption and +software bugs. +Given that fstests already contains the ability to create a filesystem +containing every metadata format known to the filesystem, ``xfs_db`` can be +used to perform exhaustive fuzz testing! + +For a given fstests filesystem configuration: + +* For each metadata object existing on the filesystem... + + * For each record inside that metadata object... + + * For each field inside that record... + + * For each conceivable type of transformation that can be applied to a bit field... + + 1. Clear all bits + 2. Set all bits + 3. Toggle the most significant bit + 4. Toggle the middle bit + 5. Toggle the least significant bit + 6. Add a small quantity + 7. Subtract a small quantity + 8. Randomize the contents + + * ...test the reactions of: + + 1. The kernel verifiers to stop obviously bad metadata + 2. Offline checking (``xfs_repair -n``) + 3. Offline repair (``xfs_repair``) + 4. Online checking (``xfs_scrub -n``) + 5. Online repair (``xfs_scrub``) + 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) + +This is quite the combinatoric explosion! + +Fortunately, having this much test coverage makes it easy for XFS developers to +check the responses of XFS' fsck tools. +Since the introduction of the fuzz testing framework, these tests have been +used to discover incorrect repair code and missing functionality for entire +classes of metadata objects in ``xfs_repair``. +The enhanced testing was used to finalize the deprecation of ``xfs_check`` by +confirming that ``xfs_repair`` could detect at least as many corruptions as +the older tool. + +These tests have been very valuable for ``xfs_scrub`` in the same ways -- they +allow the online fsck developers to compare online fsck against offline fsck, +and they enable XFS developers to find deficiencies in the code base. + +Proposed patchsets include +`general fuzzer improvements +`_, +`fuzzing baselines +`_, +and `improvements in fuzz testing comprehensiveness +`_. + +Stress Testing +-------------- + +A unique requirement to online fsck is the ability to operate on a filesystem +concurrently with regular workloads. +Although it is of course impossible to run ``xfs_scrub`` with *zero* observable +impact on the running system, the online repair code should never introduce +inconsistencies into the filesystem metadata, and regular workloads should +never notice resource starvation. +To verify that these conditions are being met, fstests has been enhanced in +the following ways: + +* For each scrub item type, create a test to exercise checking that item type + while running ``fsstress``. +* For each scrub item type, create a test to exercise repairing that item type + while running ``fsstress``. +* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole + filesystem doesn't cause problems. +* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that + force-repairing the whole filesystem doesn't cause problems. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + freezing and thawing the filesystem. +* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while + remounting the filesystem read-only and read-write. +* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) + +Success is defined by the ability to run all of these tests without observing +any unexpected filesystem shutdowns due to corrupted metadata, kernel hang +check warnings, or any other sort of mischief. + +Proposed patchsets include `general stress testing +`_ +and the `evolution of existing per-function stress testing +`_. + +4. User Interface +================= + +The primary user of online fsck is the system administrator, just like offline +repair. +Online fsck presents two modes of operation to administrators: +A foreground CLI process for online fsck on demand, and a background service +that performs autonomous checking and repair. + +Checking on Demand +------------------ + +For administrators who want the absolute freshest information about the +metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on +a command line. +The program checks every piece of metadata in the filesystem while the +administrator waits for the results to be reported, just like the existing +``xfs_repair`` tool. +Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` +option to increase the verbosity of the information reported. + +A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error +correction capabilities of the hardware to check data file contents. +The media scan is not enabled by default because it may dramatically increase +program runtime and consume a lot of bandwidth on older storage hardware. + +The output of a foreground invocation is captured in the system log. + +The ``xfs_scrub_all`` program walks the list of mounted filesystems and +initiates ``xfs_scrub`` for each of them in parallel. +It serializes scans for any filesystems that resolve to the same top level +kernel block device to prevent resource overconsumption. + +Background Service +------------------ + +To reduce the workload of system administrators, the ``xfs_scrub`` package +provides a suite of `systemd `_ timers and services that +run online fsck automatically on weekends by default. +The background service configures scrub to run with as little privilege as +possible, the lowest CPU and IO priority, and in a CPU-constrained single +threaded mode. +This can be tuned by the systemd administrator at any time to suit the latency +and throughput requirements of customer workloads. + +The output of the background service is also captured in the system log. +If desired, reports of failures (either due to inconsistencies or mere runtime +errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment +variable in the following service files: + +* ``xfs_scrub_fail@.service`` +* ``xfs_scrub_media_fail@.service`` +* ``xfs_scrub_all_fail.service`` + +The decision to enable the background scan is left to the system administrator. +This can be done by enabling either of the following services: + +* ``xfs_scrub_all.timer`` on systemd systems +* ``xfs_scrub_all.cron`` on non-systemd systems + +This automatic weekly scan is configured out of the box to perform an +additional media scan of all file data once per month. +This is less foolproof than, say, storing file data block checksums, but much +more performant if application software provides its own integrity checking, +redundancy can be provided elsewhere above the filesystem, or the storage +device's integrity guarantees are deemed sufficient. + +The systemd unit file definitions have been subjected to a security audit +(as of systemd 249) to ensure that the xfs_scrub processes have as little +access to the rest of the system as possible. +This was performed via ``systemd-analyze security``, after which privileges +were restricted to the minimum required, sandboxing was set up to the maximal +extent possible with sandboxing and system call filtering; and access to the +filesystem tree was restricted to the minimum needed to start the program and +access the filesystem being scanned. +The service definition files restrict CPU usage to 80% of one CPU core, and +apply as nice of a priority to IO and CPU scheduling as possible. +This measure was taken to minimize delays in the rest of the filesystem. +No such hardening has been performed for the cron job. + +Proposed patchset: +`Enabling the xfs_scrub background service +`_. + +Health Reporting +---------------- + +XFS caches a summary of each filesystem's health status in memory. +The information is updated whenever ``xfs_scrub`` is run, or whenever +inconsistencies are detected in the filesystem metadata during regular +operations. +System administrators should use the ``health`` command of ``xfs_spaceman`` to +download this information into a human-readable format. +If problems have been observed, the administrator can schedule a reduced +service window to run the online repair tool to correct the problem. +Failing that, the administrator can decide to schedule a maintenance window to +run the traditional offline repair tool to correct the problem. + +**Future Work Question**: Should the health reporting integrate with the new +inotify fs error notification system? +Would it be helpful for sysadmins to have a daemon to listen for corruption +notifications and initiate a repair? + +*Answer*: These questions remain unanswered, but should be a part of the +conversation with early adopters and potential downstream users of XFS. + +Proposed patchsets include +`wiring up health reports to correction returns +`_ +and +`preservation of sickness info during memory reclaim +`_. + +5. Kernel Algorithms and Data Structures +======================================== + +This section discusses the key algorithms and data structures of the kernel +code that provide the ability to check and repair metadata while the system +is running. +The first chapters in this section reveal the pieces that provide the +foundation for checking metadata. +The remainder of this section presents the mechanisms through which XFS +regenerates itself. + +Self Describing Metadata +------------------------ + +Starting with XFS version 5 in 2012, XFS updated the format of nearly every +ondisk block header to record a magic number, a checksum, a universally +"unique" identifier (UUID), an owner code, the ondisk address of the block, +and a log sequence number. +When loading a block buffer from disk, the magic number, UUID, owner, and +ondisk address confirm that the retrieved block matches the specific owner of +the current filesystem, and that the information contained in the block is +supposed to be found at the ondisk address. +The first three components enable checking tools to disregard alleged metadata +that doesn't belong to the filesystem, and the fourth component enables the +filesystem to detect lost writes. + +Whenever a file system operation modifies a block, the change is submitted +to the log as part of a transaction. +The log then processes these transactions marking them done once they are +safely persisted to storage. +The logging code maintains the checksum and the log sequence number of the last +transactional update. +Checksums are useful for detecting torn writes and other discrepancies that can +be introduced between the computer and its storage devices. +Sequence number tracking enables log recovery to avoid applying out of date +log updates to the filesystem. + +These two features improve overall runtime resiliency by providing a means for +the filesystem to detect obvious corruption when reading metadata blocks from +disk, but these buffer verifiers cannot provide any consistency checking +between metadata structures. + +For more information, please see the documentation for +Documentation/filesystems/xfs/xfs-self-describing-metadata.rst + +Reverse Mapping +--------------- + +The original design of XFS (circa 1993) is an improvement upon 1980s Unix +filesystem design. +In those days, storage density was expensive, CPU time was scarce, and +excessive seek time could kill performance. +For performance reasons, filesystem authors were reluctant to add redundancy to +the filesystem, even at the cost of data integrity. +Filesystems designers in the early 21st century choose different strategies to +increase internal redundancy -- either storing nearly identical copies of +metadata, or more space-efficient encoding techniques. + +For XFS, a different redundancy strategy was chosen to modernize the design: +a secondary space usage index that maps allocated disk extents back to their +owners. +By adding a new index, the filesystem retains most of its ability to scale +well to heavily threaded workloads involving large datasets, since the primary +file metadata (the directory tree, the file block map, and the allocation +groups) remain unchanged. +Like any system that improves redundancy, the reverse-mapping feature increases +overhead costs for space mapping activities. +However, it has two critical advantages: first, the reverse index is key to +enabling online fsck and other requested functionality such as free space +defragmentation, better media failure reporting, and filesystem shrinking. +Second, the different ondisk storage format of the reverse mapping btree +defeats device-level deduplication because the filesystem requires real +redundancy. + ++--------------------------------------------------------------------------+ +| **Sidebar**: | ++--------------------------------------------------------------------------+ +| A criticism of adding the secondary index is that it does nothing to | +| improve the robustness of user data storage itself. | +| This is a valid point, but adding a new index for file data block | +| checksums increases write amplification by turning data overwrites into | +| copy-writes, which age the filesystem prematurely. | +| In keeping with thirty years of precedent, users who want file data | +| integrity can supply as powerful a solution as they require. | +| As for metadata, the complexity of adding a new secondary index of space | +| usage is much less than adding volume management and storage device | +| mirroring to XFS itself. | +| Perfection of RAID and volume management are best left to existing | +| layers in the kernel. | ++--------------------------------------------------------------------------+ + +The information captured in a reverse space mapping record is as follows: + +.. code-block:: c + + struct xfs_rmap_irec { + xfs_agblock_t rm_startblock; /* extent start block */ + xfs_extlen_t rm_blockcount; /* extent length */ + uint64_t rm_owner; /* extent owner */ + uint64_t rm_offset; /* offset within the owner */ + unsigned int rm_flags; /* state flags */ + }; + +The first two fields capture the location and size of the physical space, +in units of filesystem blocks. +The owner field tells scrub which metadata structure or file inode have been +assigned this space. +For space allocated to files, the offset field tells scrub where the space was +mapped within the file fork. +Finally, the flags field provides extra information about the space usage -- +is this an attribute fork extent? A file mapping btree extent? Or an +unwritten data extent? + +Online filesystem checking judges the consistency of each primary metadata +record by comparing its information against all other space indices. +The reverse mapping index plays a key role in the consistency checking process +because it contains a centralized alternate copy of all space allocation +information. +Program runtime and ease of resource acquisition are the only real limits to +what online checking can consult. +For example, a file data extent mapping can be checked against: + +* The absence of an entry in the free space information. +* The absence of an entry in the inode index. +* The absence of an entry in the reference count data if the file is not + marked as having shared extents. +* The correspondence of an entry in the reverse mapping information. + +There are several observations to make about reverse mapping indices: + +1. Reverse mappings can provide a positive affirmation of correctness if any of + the above primary metadata are in doubt. + The checking code for most primary metadata follows a path similar to the + one outlined above. + +2. Proving the consistency of secondary metadata with the primary metadata is + difficult because that requires a full scan of all primary space metadata, + which is very time intensive. + For example, checking a reverse mapping record for a file extent mapping + btree block requires locking the file and searching the entire btree to + confirm the block. + Instead, scrub relies on rigorous cross-referencing during the primary space + mapping structure checks. + +3. Consistency scans must use non-blocking lock acquisition primitives if the + required locking order is not the same order used by regular filesystem + operations. + For example, if the filesystem normally takes a file ILOCK before taking + the AGF buffer lock but scrub wants to take a file ILOCK while holding + an AGF buffer lock, scrub cannot block on that second acquisition. + This means that forward progress during this part of a scan of the reverse + mapping data cannot be guaranteed if system load is heavy. + +In summary, reverse mappings play a key role in reconstruction of primary +metadata. +The details of how these records are staged, written to disk, and committed +into the filesystem are covered in subsequent sections. + +Checking and Cross-Referencing +------------------------------ + +The first step of checking a metadata structure is to examine every record +contained within the structure and its relationship with the rest of the +system. +XFS contains multiple layers of checking to try to prevent inconsistent +metadata from wreaking havoc on the system. +Each of these layers contributes information that helps the kernel to make +three decisions about the health of a metadata structure: + +- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? +- Is this structure inconsistent with the rest of the system + (``XFS_SCRUB_OFLAG_XCORRUPT``) ? +- Is there so much damage around the filesystem that cross-referencing is not + possible (``XFS_SCRUB_OFLAG_XFAIL``) ? +- Can the structure be optimized to improve performance or reduce the size of + metadata (``XFS_SCRUB_OFLAG_PREEN``) ? +- Does the structure contain data that is not inconsistent but deserves review + by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? + +The following sections describe how the metadata scrubbing process works. + +Metadata Buffer Verification +```````````````````````````` + +The lowest layer of metadata protection in XFS are the metadata verifiers built +into the buffer cache. +These functions perform inexpensive internal consistency checking of the block +itself, and answer these questions: + +- Does the block belong to this filesystem? + +- Does the block belong to the structure that asked for the read? + This assumes that metadata blocks only have one owner, which is always true + in XFS. + +- Is the type of data stored in the block within a reasonable range of what + scrub is expecting? + +- Does the physical location of the block match the location it was read from? + +- Does the block checksum match the data? + +The scope of the protections here are very limited -- verifiers can only +establish that the filesystem code is reasonably free of gross corruption bugs +and that the storage system is reasonably competent at retrieval. +Corruption problems observed at runtime cause the generation of health reports, +failed system calls, and in the extreme case, filesystem shutdowns if the +corrupt metadata force the cancellation of a dirty transaction. + +Every online fsck scrubbing function is expected to read every ondisk metadata +block of a structure in the course of checking the structure. +Corruption problems observed during a check are immediately reported to +userspace as corruption; during a cross-reference, they are reported as a +failure to cross-reference once the full examination is complete. +Reads satisfied by a buffer already in cache (and hence already verified) +bypass these checks. + +Internal Consistency Checks +``````````````````````````` + +After the buffer cache, the next level of metadata protection is the internal +record verification code built into the filesystem. +These checks are split between the buffer verifiers, the in-filesystem users of +the buffer cache, and the scrub code itself, depending on the amount of higher +level context required. +The scope of checking is still internal to the block. +These higher level checking functions answer these questions: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- If the block contains records, do the records fit within the block? + +- If the block tracks internal free space information, is it consistent with + the record areas? + +- Are the records contained inside the block free of obvious corruptions? + +Record checks in this category are more rigorous and more time-intensive. +For example, block pointers and inumbers are checked to ensure that they point +within the dynamically allocated parts of an allocation group and within +the filesystem. +Names are checked for invalid characters, and flags are checked for invalid +combinations. +Other record attributes are checked for sensible values. +Btree records spanning an interval of the btree keyspace are checked for +correct order and lack of mergeability (except for file fork mappings). +For performance reasons, regular code may skip some of these checks unless +debugging is enabled or a write is about to occur. +Scrub functions, of course, must check all possible problems. + +Validation of Userspace-Controlled Record Attributes +```````````````````````````````````````````````````` + +Various pieces of filesystem metadata are directly controlled by userspace. +Because of this nature, validation work cannot be more precise than checking +that a value is within the possible range. +These fields include: + +- Superblock fields controlled by mount options +- Filesystem labels +- File timestamps +- File permissions +- File size +- File flags +- Names present in directory entries, extended attribute keys, and filesystem + labels +- Extended attribute key namespaces +- Extended attribute values +- File data block contents +- Quota limits +- Quota timer expiration (if resource usage exceeds the soft limit) + +Cross-Referencing Space Metadata +```````````````````````````````` + +After internal block checks, the next higher level of checking is +cross-referencing records between metadata structures. +For regular runtime code, the cost of these checks is considered to be +prohibitively expensive, but as scrub is dedicated to rooting out +inconsistencies, it must pursue all avenues of inquiry. +The exact set of cross-referencing is highly dependent on the context of the +data structure being checked. + +The XFS btree code has keyspace scanning functions that online fsck uses to +cross reference one structure with another. +Specifically, scrub can scan the key space of an index to determine if that +keyspace is fully, sparsely, or not at all mapped to records. +For the reverse mapping btree, it is possible to mask parts of the key for the +purposes of performing a keyspace scan so that scrub can decide if the rmap +btree contains records mapping a certain extent of physical space without the +sparsenses of the rest of the rmap keyspace getting in the way. + +Btree blocks undergo the following checks before cross-referencing: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the btree point to valid block addresses for the type + of btree? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each node block record, does the record key accurate reflect the contents + of the child block? + +Space allocation records are cross-referenced as follows: + +1. Any space mentioned by any metadata structure are cross-referenced as + follows: + + - Does the reverse mapping index list only the appropriate owner as the + owner of each block? + + - Are none of the blocks claimed as free space? + + - If these aren't file data blocks, are none of the blocks claimed as space + shared by different owners? + +2. Btree blocks are cross-referenced as follows: + + - Everything in class 1 above. + + - If there's a parent node block, do the keys listed for this block match the + keyspace of this block? + + - Do the sibling pointers point to valid blocks? Of the same level? + + - Do the child pointers point to valid blocks? Of the next level down? + +3. Free space btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Does the reverse mapping index list no owners of this space? + + - Is this space not claimed by the inode index for inodes? + + - Is it not mentioned by the reference count index? + + - Is there a matching record in the other free space btree? + +4. Inode btree records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is there a matching record in free inode btree? + + - Do cleared bits in the holemask correspond with inode clusters? + + - Do set bits in the freemask correspond with inode records with zero link + count? + +5. Inode records are cross-referenced as follows: + + - Everything in class 1. + + - Do all the fields that summarize information about the file forks actually + match those forks? + + - Does each inode with zero link count correspond to a record in the free + inode btree? + +6. File fork space mapping records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Is this space not mentioned by the inode btrees? + + - If this is a CoW fork mapping, does it correspond to a CoW entry in the + reference count btree? + +7. Reference count records are cross-referenced as follows: + + - Everything in class 1 and 2 above. + + - Within the space subkeyspace of the rmap btree (that is to say, all + records mapped to a particular space extent and ignoring the owner info), + are there the same number of reverse mapping records for each block as the + reference count record claims? + +Proposed patchsets are the series to find gaps in +`refcount btree +`_, +`inode btree +`_, and +`rmap btree +`_ records; +to find +`mergeable records +`_; +and to +`improve cross referencing with rmap +`_ +before starting a repair. + +Checking Extended Attributes +```````````````````````````` + +Extended attributes implement a key-value store that enable fragments of data +to be attached to any file. +Both the kernel and userspace can access the keys and values, subject to +namespace and privilege restrictions. +Most typically these fragments are metadata about the file -- origins, security +contexts, user-supplied labels, indexing information, etc. + +Names can be as long as 255 bytes and can exist in several different +namespaces. +Values can be as large as 64KB. +A file's extended attributes are stored in blocks mapped by the attr fork. +The mappings point to leaf blocks, remote value blocks, or dabtree blocks. +Block 0 in the attribute fork is always the top of the structure, but otherwise +each of the three types of blocks can be found at any offset in the attr fork. +Leaf blocks contain attribute key records that point to the name and the value. +Names are always stored elsewhere in the same leaf block. +Values that are less than 3/4 the size of a filesystem block are also stored +elsewhere in the same leaf block. +Remote value blocks contain values that are too large to fit inside a leaf. +If the leaf information exceeds a single filesystem block, a dabtree (also +rooted at block 0) is created to map hashes of the attribute names to leaf +blocks in the attr fork. + +Checking an extended attribute structure is not so straightforward due to the +lack of separation between attr blocks and index blocks. +Scrub must read each block mapped by the attr fork and ignore the non-leaf +blocks: + +1. Walk the dabtree in the attr fork (if present) to ensure that there are no + irregularities in the blocks or dabtree mappings that do not point to + attr leaf blocks. + +2. Walk the blocks of the attr fork looking for leaf blocks. + For each entry inside a leaf: + + a. Validate that the name does not contain invalid characters. + + b. Read the attr value. + This performs a named lookup of the attr name to ensure the correctness + of the dabtree. + If the value is stored in a remote block, this also validates the + integrity of the remote value block. + +Checking and Cross-Referencing Directories +`````````````````````````````````````````` + +The filesystem directory tree is a directed acylic graph structure, with files +constituting the nodes, and directory entries (dirents) constituting the edges. +Directories are a special type of file containing a set of mappings from a +255-byte sequence (name) to an inumber. +These are called directory entries, or dirents for short. +Each directory file must have exactly one directory pointing to the file. +A root directory points to itself. +Directory entries point to files of any type. +Each non-directory file may have multiple directories point to it. + +In XFS, directories are implemented as a file containing up to three 32GB +partitions. +The first partition contains directory entry data blocks. +Each data block contains variable-sized records associating a user-provided +name with an inumber and, optionally, a file type. +If the directory entry data grows beyond one block, the second partition (which +exists as post-EOF extents) is populated with a block containing free space +information and an index that maps hashes of the dirent names to directory data +blocks in the first partition. +This makes directory name lookups very fast. +If this second partition grows beyond one block, the third partition is +populated with a linear array of free space information for faster +expansions. +If the free space has been separated and the second partition grows again +beyond one block, then a dabtree is used to map hashes of dirent names to +directory data blocks. + +Checking a directory is pretty straightforward: + +1. Walk the dabtree in the second partition (if present) to ensure that there + are no irregularities in the blocks or dabtree mappings that do not point to + dirent blocks. + +2. Walk the blocks of the first partition looking for directory entries. + Each dirent is checked as follows: + + a. Does the name contain no invalid characters? + + b. Does the inumber correspond to an actual, allocated inode? + + c. Does the child inode have a nonzero link count? + + d. If a file type is included in the dirent, does it match the type of the + inode? + + e. If the child is a subdirectory, does the child's dotdot pointer point + back to the parent? + + f. If the directory has a second partition, perform a named lookup of the + dirent name to ensure the correctness of the dabtree. + +3. Walk the free space list in the third partition (if present) to ensure that + the free spaces it describes are really unused. + +Checking operations involving :ref:`parents ` and +:ref:`file link counts ` are discussed in more detail in later +sections. + +Checking Directory/Attribute Btrees +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As stated in previous sections, the directory/attribute btree (dabtree) index +maps user-provided names to improve lookup times by avoiding linear scans. +Internally, it maps a 32-bit hash of the name to a block offset within the +appropriate file fork. + +The internal structure of a dabtree closely resembles the btrees that record +fixed-size metadata records -- each dabtree block contains a magic number, a +checksum, sibling pointers, a UUID, a tree level, and a log sequence number. +The format of leaf and node records are the same -- each entry points to the +next level down in the hierarchy, with dabtree node records pointing to dabtree +leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere +in the fork. + +Checking and cross-referencing the dabtree is very similar to what is done for +space btrees: + +- Does the type of data stored in the block match what scrub is expecting? + +- Does the block belong to the owning structure that asked for the read? + +- Do the records fit within the block? + +- Are the records contained inside the block free of obvious corruptions? + +- Are the name hashes in the correct order? + +- Do node pointers within the dabtree point to valid fork offsets for dabtree + blocks? + +- Do leaf pointers within the dabtree point to valid fork offsets for directory + or attr leaf blocks? + +- Do child pointers point towards the leaves? + +- Do sibling pointers point across the same level? + +- For each dabtree node record, does the record key accurate reflect the + contents of the child dabtree block? + +- For each dabtree leaf record, does the record key accurate reflect the + contents of the directory or attr block? + +Cross-Referencing Summary Counters +`````````````````````````````````` + +XFS maintains three classes of summary counters: available resources, quota +resource usage, and file link counts. + +In theory, the amount of available resources (data blocks, inodes, realtime +extents) can be found by walking the entire filesystem. +This would make for very slow reporting, so a transactional filesystem can +maintain summaries of this information in the superblock. +Cross-referencing these values against the filesystem metadata should be a +simple matter of walking the free space and inode metadata in each AG and the +realtime bitmap, but there are complications that will be discussed in +:ref:`more detail ` later. + +:ref:`Quota usage ` and :ref:`file link count ` +checking are sufficiently complicated to warrant separate sections. + +Post-Repair Reverification +`````````````````````````` + +After performing a repair, the checking code is run a second time to validate +the new structure, and the results of the health assessment are recorded +internally and returned to the calling process. +This step is critical for enabling system administrator to monitor the status +of the filesystem and the progress of any repairs. +For developers, it is a useful means to judge the efficacy of error detection +and correction in the online and offline checking tools. + +Eventual Consistency vs. Online Fsck +------------------------------------ + +Complex operations can make modifications to multiple per-AG data structures +with a chain of transactions. +These chains, once committed to the log, are restarted during log recovery if +the system crashes while processing the chain. +Because the AG header buffers are unlocked between transactions within a chain, +online checking must coordinate with chained operations that are in progress to +avoid incorrectly detecting inconsistencies due to pending chains. +Furthermore, online repair must not run when operations are pending because +the metadata are temporarily inconsistent with each other, and rebuilding is +not possible. + +Only online fsck has this requirement of total consistency of AG metadata, and +should be relatively rare as compared to filesystem change operations. +Online fsck coordinates with transaction chains as follows: + +* For each AG, maintain a count of intent items targeting that AG. + The count should be bumped whenever a new item is added to the chain. + The count should be dropped when the filesystem has locked the AG header + buffers and finished the work. + +* When online fsck wants to examine an AG, it should lock the AG header + buffers to quiesce all transaction chains that want to modify that AG. + If the count is zero, proceed with the checking operation. + If it is nonzero, cycle the buffer locks to allow the chain to make forward + progress. + +This may lead to online fsck taking a long time to complete, but regular +filesystem updates take precedence over background checking activity. +Details about the discovery of this situation are presented in the +:ref:`next section `, and details about the solution +are presented :ref:`after that`. + +.. _chain_coordination: + +Discovery of the Problem +```````````````````````` + +Midway through the development of online scrubbing, the fsstress tests +uncovered a misinteraction between online fsck and compound transaction chains +created by other writer threads that resulted in false reports of metadata +inconsistency. +The root cause of these reports is the eventual consistency model introduced by +the expansion of deferred work items and compound transaction chains when +reverse mapping and reflink were introduced. + +Originally, transaction chains were added to XFS to avoid deadlocks when +unmapping space from files. +Deadlock avoidance rules require that AGs only be locked in increasing order, +which makes it impossible (say) to use a single transaction to free a space +extent in AG 7 and then try to free a now superfluous block mapping btree block +in AG 3. +To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log +items to commit to freeing some space in one transaction while deferring the +actual metadata updates to a fresh transaction. +The transaction sequence looks like this: + +1. The first transaction contains a physical update to the file's block mapping + structures to remove the mapping from the btree blocks. + It then attaches to the in-memory transaction an action item to schedule + deferred freeing of space. + Concretely, each transaction maintains a list of ``struct + xfs_defer_pending`` objects, each of which maintains a list of ``struct + xfs_extent_free_item`` objects. + Returning to the example above, the action item tracks the freeing of both + the unmapped space from AG 7 and the block mapping btree (BMBT) block from + AG 3. + Deferred frees recorded in this manner are committed in the log by creating + an EFI log item from the ``struct xfs_extent_free_item`` object and + attaching the log item to the transaction. + When the log is persisted to disk, the EFI item is written into the ondisk + transaction record. + EFIs can list up to 16 extents to free, all sorted in AG order. + +2. The second transaction contains a physical update to the free space btrees + of AG 3 to release the former BMBT block and a second physical update to the + free space btrees of AG 7 to release the unmapped file space. + Observe that the physical updates are resequenced in the correct order + when possible. + Attached to the transaction is a an extent free done (EFD) log item. + The EFD contains a pointer to the EFI logged in transaction #1 so that log + recovery can tell if the EFI needs to be replayed. + +If the system goes down after transaction #1 is written back to the filesystem +but before #2 is committed, a scan of the filesystem metadata would show +inconsistent filesystem metadata because there would not appear to be any owner +of the unmapped space. +Happily, log recovery corrects this inconsistency for us -- when recovery finds +an intent log item but does not find a corresponding intent done item, it will +reconstruct the incore state of the intent item and finish it. +In the example above, the log must replay both frees described in the recovered +EFI to complete the recovery phase. + +There are subtleties to XFS' transaction chaining strategy to consider: + +* Log items must be added to a transaction in the correct order to prevent + conflicts with principal objects that are not held by the transaction. + In other words, all per-AG metadata updates for an unmapped block must be + completed before the last update to free the extent, and extents should not + be reallocated until that last update commits to the log. + +* AG header buffers are released between each transaction in a chain. + This means that other threads can observe an AG in an intermediate state, + but as long as the first subtlety is handled, this should not affect the + correctness of filesystem operations. + +* Unmounting the filesystem flushes all pending work to disk, which means that + offline fsck never sees the temporary inconsistencies caused by deferred + work item processing. + +In this manner, XFS employs a form of eventual consistency to avoid deadlocks +and increase parallelism. + +During the design phase of the reverse mapping and reflink features, it was +decided that it was impractical to cram all the reverse mapping updates for a +single filesystem change into a single transaction because a single file +mapping operation can explode into many small updates: + +* The block mapping update itself +* A reverse mapping update for the block mapping update +* Fixing the freelist +* A reverse mapping update for the freelist fix + +* A shape change to the block mapping btree +* A reverse mapping update for the btree update +* Fixing the freelist (again) +* A reverse mapping update for the freelist fix + +* An update to the reference counting information +* A reverse mapping update for the refcount update +* Fixing the freelist (a third time) +* A reverse mapping update for the freelist fix + +* Freeing any space that was unmapped and not owned by any other file +* Fixing the freelist (a fourth time) +* A reverse mapping update for the freelist fix + +* Freeing the space used by the block mapping btree +* Fixing the freelist (a fifth time) +* A reverse mapping update for the freelist fix + +Free list fixups are not usually needed more than once per AG per transaction +chain, but it is theoretically possible if space is very tight. +For copy-on-write updates this is even worse, because this must be done once to +remove the space from a staging area and again to map it into the file! + +To deal with this explosion in a calm manner, XFS expands its use of deferred +work items to cover most reverse mapping updates and all refcount updates. +This reduces the worst case size of transaction reservations by breaking the +work into a long chain of small updates, which increases the degree of eventual +consistency in the system. +Again, this generally isn't a problem because XFS orders its deferred work +items carefully to avoid resource reuse conflicts between unsuspecting threads. + +However, online fsck changes the rules -- remember that although physical +updates to per-AG structures are coordinated by locking the buffers for AG +headers, buffer locks are dropped between transactions. +Once scrub acquires resources and takes locks for a data structure, it must do +all the validation work without releasing the lock. +If the main lock for a space btree is an AG header buffer lock, scrub may have +interrupted another thread that is midway through finishing a chain. +For example, if a thread performing a copy-on-write has completed a reverse +mapping update but not the corresponding refcount update, the two AG btrees +will appear inconsistent to scrub and an observation of corruption will be +recorded. This observation will not be correct. +If a repair is attempted in this state, the results will be catastrophic! + +Several other solutions to this problem were evaluated upon discovery of this +flaw and rejected: + +1. Add a higher level lock to allocation groups and require writer threads to + acquire the higher level lock in AG order before making any changes. + This would be very difficult to implement in practice because it is + difficult to determine which locks need to be obtained, and in what order, + without simulating the entire operation. + Performing a dry run of a file operation to discover necessary locks would + make the filesystem very slow. + +2. Make the deferred work coordinator code aware of consecutive intent items + targeting the same AG and have it hold the AG header buffers locked across + the transaction roll between updates. + This would introduce a lot of complexity into the coordinator since it is + only loosely coupled with the actual deferred work items. + It would also fail to solve the problem because deferred work items can + generate new deferred subtasks, but all subtasks must be complete before + work can start on a new sibling task. + +3. Teach online fsck to walk all transactions waiting for whichever lock(s) + protect the data structure being scrubbed to look for pending operations. + The checking and repair operations must factor these pending operations into + the evaluations being performed. + This solution is a nonstarter because it is *extremely* invasive to the main + filesystem. + +.. _intent_drains: + +Intent Drains +````````````` + +Online fsck uses an atomic intent item counter and lock cycling to coordinate +with transaction chains. +There are two key properties to the drain mechanism. +First, the counter is incremented when a deferred work item is *queued* to a +transaction, and it is decremented after the associated intent done log item is +*committed* to another transaction. +The second property is that deferred work can be added to a transaction without +holding an AG header lock, but per-AG work items cannot be marked done without +locking that AG header buffer to log the physical updates and the intent done +log item. +The first property enables scrub to yield to running transaction chains, which +is an explicit deprioritization of online fsck to benefit file operations. +The second property of the drain is key to the correct coordination of scrub, +since scrub will always be able to decide if a conflict is possible. + +For regular filesystem code, the drain works as follows: + +1. Call the appropriate subsystem function to add a deferred work item to a + transaction. + +2. The function calls ``xfs_defer_drain_bump`` to increase the counter. + +3. When the deferred item manager wants to finish the deferred work item, it + calls ``->finish_item`` to complete it. + +4. The ``->finish_item`` implementation logs some changes and calls + ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads + waiting on the drain. + +5. The subtransaction commits, which unlocks the resource associated with the + intent item. + +For scrub, the drain works as follows: + +1. Lock the resource(s) associated with the metadata being scrubbed. + For example, a scan of the refcount btree would lock the AGI and AGF header + buffers. + +2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no + chains in progress and the operation may proceed. + +3. Otherwise, release the resources grabbed in step 1. + +4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go + back to step 1 unless a signal has been caught. + +To avoid polling in step 4, the drain provides a waitqueue for scrub threads to +be woken up whenever the intent count drops to zero. + +The proposed patchset is the +`scrub intent drain series +`_. + +.. _jump_labels: + +Static Keys (aka Jump Label Patching) +````````````````````````````````````` + +Online fsck for XFS separates the regular filesystem from the checking and +repair code as much as possible. +However, there are a few parts of online fsck (such as the intent drains, and +later, live update hooks) where it is useful for the online fsck code to know +what's going on in the rest of the filesystem. +Since it is not expected that online fsck will be constantly running in the +background, it is very important to minimize the runtime overhead imposed by +these hooks when online fsck is compiled into the kernel but not actively +running on behalf of userspace. +Taking locks in the hot path of a writer thread to access a data structure only +to find that no further action is necessary is expensive -- on the author's +computer, this have an overhead of 40-50ns per access. +Fortunately, the kernel supports dynamic code patching, which enables XFS to +replace a static branch to hook code with ``nop`` sleds when online fsck isn't +running. +This sled has an overhead of however long it takes the instruction decoder to +skip past the sled, which seems to be on the order of less than 1ns and +does not access memory outside of instruction fetching. + +When online fsck enables the static key, the sled is replaced with an +unconditional branch to call the hook code. +The switchover is quite expensive (~22000ns) but is paid entirely by the +program that invoked online fsck, and can be amortized if multiple threads +enter online fsck at the same time, or if multiple filesystems are being +checked at the same time. +Changing the branch direction requires taking the CPU hotplug lock, and since +CPU initialization requires memory allocation, online fsck must be careful not +to change a static key while holding any locks or resources that could be +accessed in the memory reclaim paths. +To minimize contention on the CPU hotplug lock, care should be taken not to +enable or disable static keys unnecessarily. + +Because static keys are intended to minimize hook overhead for regular +filesystem operations when xfs_scrub is not running, the intended usage +patterns are as follows: + +- The hooked part of XFS should declare a static-scoped static key that + defaults to false. + The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. + The static key itself should be declared as a ``static`` variable. + +- When deciding to invoke code that's only used by scrub, the regular + filesystem should call the ``static_branch_unlikely`` predicate to avoid the + scrub-only hook code if the static key is not enabled. + +- The regular filesystem should export helper functions that call + ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the + static key. + Wrapper functions make it easy to compile out the relevant code if the kernel + distributor turns off online fsck at build time. + +- Scrub functions wanting to turn on scrub-only XFS functionality should call + the ``xchk_fsgates_enable`` from the setup function to enable a specific + hook. + This must be done before obtaining any resources that are used by memory + reclaim. + Callers had better be sure they really need the functionality gated by the + static key; the ``TRY_HARDER`` flag is useful here. + +Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to +handle locking AGI and AGF buffers for all scrubber functions. +If it detects a conflict between scrub and the running transactions, it will +try to wait for intents to complete. +If the caller of the helper has not enabled the static key, the helper will +return -EDEADLOCK, which should result in the scrub being restarted with the +``TRY_HARDER`` flag set. +The scrub setup function should detect that flag, enable the static key, and +try the scrub again. +Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. + +For more information, please see the kernel documentation of +Documentation/staging/static-keys.rst. + +.. _xfile: + +Pageable Kernel Memory +---------------------- + +Some online checking functions work by scanning the filesystem to build a +shadow copy of an ondisk metadata structure in memory and comparing the two +copies. +For online repair to rebuild a metadata structure, it must compute the record +set that will be stored in the new structure before it can persist that new +structure to disk. +Ideally, repairs complete with a single atomic commit that introduces +a new data structure. +To meet these goals, the kernel needs to collect a large amount of information +in a place that doesn't require the correct operation of the filesystem. + +Kernel memory isn't suitable because: + +* Allocating a contiguous region of memory to create a C array is very + difficult, especially on 32-bit systems. + +* Linked lists of records introduce double pointer overhead which is very high + and eliminate the possibility of indexed lookups. + +* Kernel memory is pinned, which can drive the system into OOM conditions. + +* The system might not have sufficient memory to stage all the information. + +At any given time, online fsck does not need to keep the entire record set in +memory, which means that individual records can be paged out if necessary. +Continued development of online fsck demonstrated that the ability to perform +indexed data storage would also be very useful. +Fortunately, the Linux kernel already has a facility for byte-addressable and +pageable storage: tmpfs. +In-kernel graphics drivers (most notably i915) take advantage of tmpfs files +to store intermediate data that doesn't need to be in memory at all times, so +that usage precedent is already established. +Hence, the ``xfile`` was born! + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The first edition of online repair inserted records into a new btree as | +| it found them, which failed because filesystem could shut down with a | +| built data structure, which would be live after recovery finished. | +| | +| The second edition solved the half-rebuilt structure problem by storing | +| everything in memory, but frequently ran the system out of memory. | +| | +| The third edition solved the OOM problem by using linked lists, but the | +| memory overhead of the list pointers was extreme. | ++--------------------------------------------------------------------------+ + +xfile Access Models +``````````````````` + +A survey of the intended uses of xfiles suggested these use cases: + +1. Arrays of fixed-sized records (space management btrees, directory and + extended attribute entries) + +2. Sparse arrays of fixed-sized records (quotas and link counts) + +3. Large binary objects (BLOBs) of variable sizes (directory and extended + attribute names and values) + +4. Staging btrees in memory (reverse mapping btrees) + +5. Arbitrary contents (realtime space management) + +To support the first four use cases, high level data structures wrap the xfile +to share functionality between online fsck functions. +The rest of this section discusses the interfaces that the xfile presents to +four of those five higher level data structures. +The fifth use case is discussed in the :ref:`realtime summary ` case +study. + +The most general storage interface supported by the xfile enables the reading +and writing of arbitrary quantities of data at arbitrary offsets in the xfile. +This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, +which behave similarly to their userspace counterparts. +XFS is very record-based, which suggests that the ability to load and store +complete records is important. +To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` +functions are provided to read and persist objects into an xfile. +They are internally the same as pread and pwrite, except that they treat any +error as an out of memory error. +For online repair, squashing error conditions in this manner is an acceptable +behavior because the only reaction is to abort the operation back to userspace. +All five xfile usecases can be serviced by these four functions. + +However, no discussion of file access idioms is complete without answering the +question, "But what about mmap?" +It is convenient to access storage directly with pointers, just like userspace +code does with regular memory. +Online fsck must not drive the system into OOM conditions, which means that +xfiles must be responsive to memory reclamation. +tmpfs can only push a pagecache folio to the swap cache if the folio is neither +pinned nor locked, which means the xfile must not pin too many folios. + +Short term direct access to xfile contents is done by locking the pagecache +folio and mapping it into kernel address space. +Programmatic access (e.g. pread and pwrite) uses this mechanism. +Folio locks are not supposed to be held for long periods of time, so long +term direct access to xfile contents is done by bumping the folio refcount, +mapping it into kernel address space, and dropping the folio lock. +These long term users *must* be responsive to memory reclaim by hooking into +the shrinker infrastructure to know when to release folios. + +The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to +retrieve the (locked) folio that backs part of an xfile and to release it. +The only code to use these folio lease functions are the xfarray +:ref:`sorting` algorithms and the :ref:`in-memory +btrees`. + +xfile Access Coordination +````````````````````````` + +For security reasons, xfiles must be owned privately by the kernel. +They are marked ``S_PRIVATE`` to prevent interference from the security system, +must never be mapped into process file descriptor tables, and their pages must +never be mapped into userspace processes. + +To avoid locking recursion issues with the VFS, all accesses to the shmfs file +are performed by manipulating the page cache directly. +xfile writers call the ``->write_begin`` and ``->write_end`` functions of the +xfile's address space to grab writable pages, copy the caller's buffer into the +page, and release the pages. +xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly +before copying the contents into the caller's buffer. +In other words, xfiles ignore the VFS read and write code paths to avoid +having to create a dummy ``struct kiocb`` and to avoid taking inode and +freeze locks. +tmpfs cannot be frozen, and xfiles must not be exposed to userspace. + +If an xfile is shared between threads to stage repairs, the caller must provide +its own locks to coordinate access. +For example, if a scrub function stores scan results in an xfile and needs +other threads to provide updates to the scanned data, the scrub function must +provide a lock for all threads to share. + +.. _xfarray: + +Arrays of Fixed-Sized Records +````````````````````````````` + +In XFS, each type of indexed space metadata (free space, inodes, reference +counts, file fork space, and reverse mappings) consists of a set of fixed-size +records indexed with a classic B+ tree. +Directories have a set of fixed-size dirent records that point to the names, +and extended attributes have a set of fixed-size attribute keys that point to +names and values. +Quota counters and file link counters index records with numbers. +During a repair, scrub needs to stage new records during the gathering step and +retrieve them during the btree building step. + +Although this requirement can be satisfied by calling the read and write +methods of the xfile directly, it is simpler for callers for there to be a +higher level abstraction to take care of computing array offsets, to provide +iterator functions, and to deal with sparse records and sorting. +The ``xfarray`` abstraction presents a linear array for fixed-size records atop +the byte-accessible xfile. + +.. _xfarray_access_patterns: + +Array Access Patterns +^^^^^^^^^^^^^^^^^^^^^ + +Array access patterns in online fsck tend to fall into three categories. +Iteration of records is assumed to be necessary for all cases and will be +covered in the next section. + +The first type of caller handles records that are indexed by position. +Gaps may exist between records, and a record may be updated multiple times +during the collection step. +In other words, these callers want a sparse linearly addressed table file. +The typical use case are quota records or file link count records. +Access to array elements is performed programmatically via ``xfarray_load`` and +``xfarray_store`` functions, which wrap the similarly-named xfile functions to +provide loading and storing of array elements at arbitrary array indices. +Gaps are defined to be null records, and null records are defined to be a +sequence of all zero bytes. +Null records are detected by calling ``xfarray_element_is_null``. +They are created either by calling ``xfarray_unset`` to null out an existing +record or by never storing anything to an array index. + +The second type of caller handles records that are not indexed by position +and do not require multiple updates to a record. +The typical use case here is rebuilding space btrees and key/value btrees. +These callers can add records to the array without caring about array indices +via the ``xfarray_append`` function, which stores a record at the end of the +array. +For callers that require records to be presentable in a specific order (e.g. +rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted +records; this function will be covered later. + +The third type of caller is a bag, which is useful for counting records. +The typical use case here is constructing space extent reference counts from +reverse mapping information. +Records can be put in the bag in any order, they can be removed from the bag +at any time, and uniqueness of records is left to callers. +The ``xfarray_store_anywhere`` function is used to insert a record in any +null record slot in the bag; and the ``xfarray_unset`` function removes a +record from the bag. + +The proposed patchset is the +`big in-memory array +`_. + +Iterating Array Elements +^^^^^^^^^^^^^^^^^^^^^^^^ + +Most users of the xfarray require the ability to iterate the records stored in +the array. +Callers can probe every possible array index with the following: + +.. code-block:: c + + xfarray_idx_t i; + foreach_xfarray_idx(array, i) { + xfarray_load(array, i, &rec); + + /* do something with rec */ + } + +All users of this idiom must be prepared to handle null records or must already +know that there aren't any. + +For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` +function ignores indices in the xfarray that have never been written to by +calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas +of the array that are not populated with memory pages. +Once it finds a page, it will skip the zeroed areas of the page. + +.. code-block:: c + + xfarray_idx_t i = XFARRAY_CURSOR_INIT; + while ((ret = xfarray_iter(array, &i, &rec)) == 1) { + /* do something with rec */ + } + +.. _xfarray_sort: + +Sorting Array Elements +^^^^^^^^^^^^^^^^^^^^^^ + +During the fourth demonstration of online repair, a community reviewer remarked +that for performance reasons, online repair ought to load batches of records +into btree record blocks instead of inserting records into a new btree one at a +time. +The btree insertion code in XFS is responsible for maintaining correct ordering +of the records, so naturally the xfarray must also support sorting the record +set prior to bulk loading. + +Case Study: Sorting xfarrays +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sorting algorithm used in the xfarray is actually a combination of adaptive +quicksort and a heapsort subalgorithm in the spirit of +`Sedgewick `_ and +`pdqsort `_, with customizations for the Linux +kernel. +To sort records in a reasonably short amount of time, ``xfarray`` takes +advantage of the binary subpartitioning offered by quicksort, but it also uses +heapsort to hedge against performance collapse if the chosen quicksort pivots +are poor. +Both algorithms are (in general) O(n * lg(n)), but there is a wide performance +gulf between the two implementations. + +The Linux kernel already contains a reasonably fast implementation of heapsort. +It only operates on regular C arrays, which limits the scope of its usefulness. +There are two key places where the xfarray uses it: + +* Sorting any record subset backed by a single xfile page. + +* Loading a small number of xfarray records from potentially disparate parts + of the xfarray into a memory buffer, and sorting the buffer. + +In other words, ``xfarray`` uses heapsort to constrain the nested recursion of +quicksort, thereby mitigating quicksort's worst runtime behavior. + +Choosing a quicksort pivot is a tricky business. +A good pivot splits the set to sort in half, leading to the divide and conquer +behavior that is crucial to O(n * lg(n)) performance. +A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) +runtime. +The xfarray sort routine tries to avoid picking a bad pivot by sampling nine +records into a memory buffer and using the kernel heapsort to identify the +median of the nine. + +Most modern quicksort implementations employ Tukey's "ninther" to select a +pivot from a classic C array. +Typical ninther implementations pick three unique triads of records, sort each +of the triads, and then sort the middle value of each triad to determine the +ninther value. +As stated previously, however, xfile accesses are not entirely cheap. +It turned out to be much more performant to read the nine elements into a +memory buffer, run the kernel's in-memory heapsort on the buffer, and choose +the 4th element of that buffer as the pivot. +Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for +low-effort robust (resistant) location in large samples`, in *Contributions to +Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, +1978), pp. 251–257. + +The partitioning of quicksort is fairly textbook -- rearrange the record +subset around the pivot, then set up the current and next stack frames to +sort with the larger and the smaller halves of the pivot, respectively. +This keeps the stack space requirements to log2(record count). + +As a final performance optimization, the hi and lo scanning phase of quicksort +keeps examined xfile pages mapped in the kernel for as long as possible to +reduce map/unmap cycles. +Surprisingly, this reduces overall sort runtime by nearly half again after +accounting for the application of heapsort directly onto xfile pages. + +.. _xfblob: + +Blob Storage +```````````` + +Extended attributes and directories add an additional requirement for staging +records: arbitrary byte sequences of finite length. +Each directory entry record needs to store entry name, +and each extended attribute needs to store both the attribute name and value. +The names, keys, and values can consume a large amount of memory, so the +``xfblob`` abstraction was created to simplify management of these blobs +atop an xfile. + +Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve +and persist objects. +The store function returns a magic cookie for every object that it persists. +Later, callers provide this cookie to the ``xblob_load`` to recall the object. +The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` +function frees them all because compaction is not needed. + +The details of repairing directories and extended attributes will be discussed +in a subsequent section about atomic extent swapping. +However, it should be noted that these repair functions only use blob storage +to cache a small number of entries before adding them to a temporary ondisk +file, which is why compaction is not required. + +The proposed patchset is at the start of the +`extended attribute repair +`_ series. + +.. _xfbtree: + +In-Memory B+Trees +````````````````` + +The chapter about :ref:`secondary metadata` mentioned that +checking and repairing of secondary metadata commonly requires coordination +between a live metadata scan of the filesystem and writer threads that are +updating that metadata. +Keeping the scan data up to date requires requires the ability to propagate +metadata updates from the filesystem into the data being collected by the scan. +This *can* be done by appending concurrent updates into a separate log file and +applying them before writing the new metadata to disk, but this leads to +unbounded memory consumption if the rest of the system is very busy. +Another option is to skip the side-log and commit live updates from the +filesystem directly into the scan data, which trades more overhead for a lower +maximum memory requirement. +In both cases, the data structure holding the scan results must support indexed +access to perform well. + +Given that indexed lookups of scan data is required for both strategies, online +fsck employs the second strategy of committing live updates directly into +scan data. +Because xfarrays are not indexed and do not enforce record ordering, they +are not suitable for this task. +Conveniently, however, XFS has a library to create and maintain ordered reverse +mapping records: the existing rmap btree code! +If only there was a means to create one in memory. + +Recall that the :ref:`xfile ` abstraction represents memory pages as a +regular file, which means that the kernel can create byte or block addressable +virtual address spaces at will. +The XFS buffer cache specializes in abstracting IO to block-oriented address +spaces, which means that adaptation of the buffer cache to interface with +xfiles enables reuse of the entire btree library. +Btrees built atop an xfile are collectively known as ``xfbtrees``. +The next few sections describe how they actually work. + +The proposed patchset is the +`in-memory btree +`_ +series. + +Using xfiles as a Buffer Cache Target +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Two modifications are necessary to support xfiles as a buffer cache target. +The first is to make it possible for the ``struct xfs_buftarg`` structure to +host the ``struct xfs_buf`` rhashtable, because normally those are held by a +per-AG structure. +The second change is to modify the buffer ``ioapply`` function to "read" cached +pages from the xfile and "write" cached pages back to the xfile. +Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, +since the xfile does not provide any locking on its own. +With this adaptation in place, users of the xfile-backed buffer cache use +exactly the same APIs as users of the disk-backed buffer cache. +The separation between xfile and buffer cache implies higher memory usage since +they do not share pages, but this property could some day enable transactional +updates to an in-memory btree. +Today, however, it simply eliminates the need for new code. + +Space Management with an xfbtree +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Space management for an xfile is very simple -- each btree block is one memory +page in size. +These blocks use the same header format as an on-disk btree, but the in-memory +block verifiers ignore the checksums, assuming that xfile memory is no more +corruption-prone than regular DRAM. +Reusing existing code here is more important than absolute memory efficiency. + +The very first block of an xfile backing an xfbtree contains a header block. +The header describes the owner, height, and the block number of the root +xfbtree block. + +To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. +If there are no gaps, create one by extending the length of the xfile. +Preallocate space for the block with ``xfile_prealloc``, and hand back the +location. +To free an xfbtree block, use ``xfile_discard`` (which internally uses +``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. + +Populating an xfbtree +^^^^^^^^^^^^^^^^^^^^^ + +An online fsck function that wants to create an xfbtree should proceed as +follows: + +1. Call ``xfile_create`` to create an xfile. + +2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure + pointing to the xfile. + +3. Pass the buffer cache target, buffer ops, and other information to + ``xfbtree_create`` to write an initial tree header and root block to the + xfile. + Each btree type should define a wrapper that passes necessary arguments to + the creation function. + For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of + all the necessary details for callers. + A ``struct xfbtree`` object will be returned. + +4. Pass the xfbtree object to the btree cursor creation function for the + btree type. + Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this + for callers. + +5. Pass the btree cursor to the regular btree functions to make queries against + and to update the in-memory btree. + For example, a btree cursor for an rmap xfbtree can be passed to the + ``xfs_rmap_*`` functions just like any other btree cursor. + See the :ref:`next section` for information on dealing with + xfbtree updates that are logged to a transaction. + +6. When finished, delete the btree cursor, destroy the xfbtree object, free the + buffer target, and the destroy the xfile to release all resources. + +.. _xfbtree_commit: + +Committing Logged xfbtree Buffers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although it is a clever hack to reuse the rmap btree code to handle the staging +structure, the ephemeral nature of the in-memory btree block storage presents +some challenges of its own. +The XFS transaction manager must not commit buffer log items for buffers backed +by an xfile because the log format does not understand updates for devices +other than the data device. +An ephemeral xfbtree probably will not exist by the time the AIL checkpoints +log transactions back into the filesystem, and certainly won't exist during +log recovery. +For these reasons, any code updating an xfbtree in transaction context must +remove the buffer log items from the transaction and write the updates into the +backing xfile before committing or cancelling the transaction. + +The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement +this functionality as follows: + +1. Find each buffer log item whose buffer targets the xfile. + +2. Record the dirty/ordered status of the log item. + +3. Detach the log item from the buffer. + +4. Queue the buffer to a special delwri list. + +5. Clear the transaction dirty flag if the only dirty log items were the ones + that were detached in step 3. + +6. Submit the delwri list to commit the changes to the xfile, if the updates + are being committed. + +After removing xfile logged buffers from the transaction in this manner, the +transaction can be committed or cancelled. + +Bulk Loading of Ondisk B+Trees +------------------------------ + +As mentioned previously, early iterations of online repair built new btree +structures by creating a new btree and adding observations individually. +Loading a btree one record at a time had a slight advantage of not requiring +the incore records to be sorted prior to commit, but was very slow and leaked +blocks if the system went down during a repair. +Loading records one at a time also meant that repair could not control the +loading factor of the blocks in the new btree. + +Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for +rebuilding a btree index from a collection of records -- bulk btree loading. +This was implemented rather inefficiently code-wise, since ``xfs_repair`` +had separate copy-pasted implementations for each btree type. + +To prepare for online fsck, each of the four bulk loaders were studied, notes +were taken, and the four were refactored into a single generic btree bulk +loading mechanism. +Those notes in turn have been refreshed and are presented below. + +Geometry Computation +```````````````````` + +The zeroth step of bulk loading is to assemble the entire record set that will +be stored in the new btree, and sort the records. +Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the +btree from the record set, the type of btree, and any load factor preferences. +This information is required for resource reservation. + +First, the geometry computation computes the minimum and maximum records that +will fit in a leaf block from the size of a btree block and the size of the +block header. +Roughly speaking, the maximum number of records is:: + + maxrecs = (block_size - header_size) / record_size + +The XFS design specifies that btree blocks should be merged when possible, +which means the minimum number of records is half of maxrecs:: + + minrecs = maxrecs / 2 + +The next variable to determine is the desired loading factor. +This must be at least minrecs and no more than maxrecs. +Choosing minrecs is undesirable because it wastes half the block. +Choosing maxrecs is also undesirable because adding a single record to each +newly rebuilt leaf block will cause a tree split, which causes a noticeable +drop in performance immediately afterwards. +The default loading factor was chosen to be 75% of maxrecs, which provides a +reasonably compact structure without any immediate split penalties:: + + default_load_factor = (maxrecs + minrecs) / 2 + +If space is tight, the loading factor will be set to maxrecs to try to avoid +running out of space:: + + leaf_load_factor = enough space ? default_load_factor : maxrecs + +Load factor is computed for btree node blocks using the combined size of the +btree key and pointer as the record size:: + + maxrecs = (block_size - header_size) / (key_size + ptr_size) + minrecs = maxrecs / 2 + node_load_factor = enough space ? default_load_factor : maxrecs + +Once that's done, the number of leaf blocks required to store the record set +can be computed as:: + + leaf_blocks = ceil(record_count / leaf_load_factor) + +The number of node blocks needed to point to the next level down in the tree +is computed as:: + + n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) + node_blocks[n + 1] = ceil(n_blocks / node_load_factor) + +The entire computation is performed recursively until the current level only +needs one block. +The resulting geometry is as follows: + +- For AG-rooted btrees, this level is the root level, so the height of the new + tree is ``level + 1`` and the space needed is the summation of the number of + blocks on each level. + +- For inode-rooted btrees where the records in the top level do not fit in the + inode fork area, the height is ``level + 2``, the space needed is the + summation of the number of blocks on each level, and the inode fork points to + the root block. + +- For inode-rooted btrees where the records in the top level can be stored in + the inode fork area, then the root block can be stored in the inode, the + height is ``level + 1``, and the space needed is one less than the summation + of the number of blocks on each level. + This only becomes relevant when non-bmap btrees gain the ability to root in + an inode, which is a future patchset and only included here for completeness. + +.. _newbt: + +Reserving New B+Tree Blocks +``````````````````````````` + +Once repair knows the number of blocks needed for the new btree, it allocates +those blocks using the free space information. +Each reserved extent is tracked separately by the btree builder state data. +To improve crash resilience, the reservation code also logs an Extent Freeing +Intent (EFI) item in the same transaction as each space allocation and attaches +its in-memory ``struct xfs_extent_free_item`` object to the space reservation. +If the system goes down, log recovery will use the unfinished EFIs to free the +unused space, the free space, leaving the filesystem unchanged. + +Each time the btree builder claims a block for the btree from a reserved +extent, it updates the in-memory reservation to reflect the claimed space. +Block reservation tries to allocate as much contiguous space as possible to +reduce the number of EFIs in play. + +While repair is writing these new btree blocks, the EFIs created for the space +reservations pin the tail of the ondisk log. +It's possible that other parts of the system will remain busy and push the head +of the log towards the pinned tail. +To avoid livelocking the filesystem, the EFIs must not pin the tail of the log +for too long. +To alleviate this problem, the dynamic relogging capability of the deferred ops +mechanism is reused here to commit a transaction at the log head containing an +EFD for the old EFI and new EFI at the head. +This enables the log to release the old EFI to keep the log moving forwards. + +EFIs have a role to play during the commit and reaping phases; please see the +next section and the section about :ref:`reaping` for more details. + +Proposed patchsets are the +`bitmap rework +`_ +and the +`preparation for bulk loading btrees +`_. + + +Writing the New Tree +```````````````````` + +This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims +a block from the reserved list, writes the new btree block header, fills the +rest of the block with records, and adds the new leaf block to a list of +written blocks:: + + ┌────┐ + │leaf│ + │RRR │ + └────┘ + +Sibling pointers are set every time a new block is added to the level:: + + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it finishes writing the record leaf blocks, it moves on to the node +blocks +To fill a node block, it walks each block in the next level down in the tree +to compute the relevant keys and write them into the parent node:: + + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +When it reaches the root level, it is ready to commit the new btree!:: + + ┌─────────┐ + │ root │ + │ PP │ + └─────────┘ + ↙ ↘ + ┌────┐ ┌────┐ + │node│──────→│node│ + │PP │←──────│PP │ + └────┘ └────┘ + ↙ ↘ ↙ ↘ + ┌────┐ ┌────┐ ┌────┐ ┌────┐ + │leaf│→│leaf│→│leaf│→│leaf│ + │RRR │←│RRR │←│RRR │←│RRR │ + └────┘ └────┘ └────┘ └────┘ + +The first step to commit the new btree is to persist the btree blocks to disk +synchronously. +This is a little complicated because a new btree block could have been freed +in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to +remove the (stale) buffer from the AIL list before it can write the new blocks +to disk. +Blocks are queued for IO using a delwri list and written in one large batch +with ``xfs_buf_delwri_submit``. + +Once the new blocks have been persisted to disk, control returns to the +individual repair function that called the bulk loader. +The repair function must log the location of the new root in a transaction, +clean up the space reservations that were made for the new btree, and reap the +old metadata blocks: + +1. Commit the location of the new btree root. + +2. For each incore reservation: + + a. Log Extent Freeing Done (EFD) items for all the space that was consumed + by the btree builder. The new EFDs must point to the EFIs attached to + the reservation to prevent log recovery from freeing the new blocks. + + b. For unclaimed portions of incore reservations, create a regular deferred + extent free work item to be free the unused space later in the + transaction chain. + + c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the + reservation of the committing transaction. + If the btree loading code suspects this might be about to happen, it must + call ``xrep_defer_finish`` to clear out the deferred work and obtain a + fresh transaction. + +3. Clear out the deferred work a second time to finish the commit and clean + the repair transaction. + +The transaction rolling in steps 2c and 3 represent a weakness in the repair +algorithm, because a log flush and a crash before the end of the reap step can +result in space leaking. +Online repair functions minimize the chances of this occurring by using very +large transactions, which each can accommodate many thousands of block freeing +instructions. +Repair moves on to reaping the old blocks, which will be presented in a +subsequent :ref:`section` after a few case studies of bulk loading. + +Case Study: Rebuilding the Inode Index +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild the inode index btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` + records from the inode chunk information and a bitmap of the old inode btree + blocks. + +2. Append the records to an xfarray in inode order. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the inode btree. + If the free space inode btree is enabled, call it again to estimate the + geometry of the finobt. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + If the free space inode btree is enabled, call it again to load the finobt. + +6. Commit the location of the new btree root block(s) to the AGI. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows. + +The inode btree maps inumbers to the ondisk location of the associated +inode records, which means that the inode btrees can be rebuilt from the +reverse mapping information. +Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the +location of the old inode btree blocks. +Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the +location of at least one inode cluster buffer. +A cluster is the smallest number of ondisk inodes that can be allocated or +freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. + +For the space represented by each inode cluster, ensure that there are no +records in the free space btrees nor any records in the reference count btree. +If there are, the space metadata inconsistencies are reason enough to abort the +operation. +Otherwise, read each cluster buffer to check that its contents appear to be +ondisk inodes and to decide if the file is allocated +(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). +Accumulate the results of successive inode cluster buffer reads until there is +enough information to fill a single inode chunk record, which is 64 consecutive +numbers in the inumber keyspace. +If the chunk is sparse, the chunk record may include holes. + +Once the repair function accumulates one chunk's worth of data, it calls +``xfarray_append`` to add the inode btree record to the xfarray. +This xfarray is walked twice during the btree creation step -- once to populate +the inode btree with all inode chunk records, and a second time to populate the +free inode btree with records for chunks that have free non-sparse inodes. +The number of records for the inode btree is the number of xfarray records, +but the record count for the free inode btree has to be computed as inode chunk +records are stored in the xfarray. + +The proposed patchset is the +`AG btree repair +`_ +series. + +Case Study: Rebuilding the Space Reference Counts +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Reverse mapping records are used to rebuild the reference count information. +Reference counts are required for correct operation of copy on write for shared +file data. +Imagine the reverse mapping entries as rectangles representing extents of +physical blocks, and that the rectangles can be laid down to allow them to +overlap each other. +From the diagram below, it is apparent that a reference count record must start +or end wherever the height of the stack changes. +In other words, the record emission stimulus is level-triggered:: + + █ ███ + ██ █████ ████ ███ ██████ + ██ ████ ███████████ ████ █████████ + ████████████████████████████████ ███████████ + ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ + 2 1 23 21 3 43 234 2123 1 01 2 3 0 + +The ondisk reference count btree does not store the refcount == 0 cases because +the free space btree already records which blocks are free. +Extents being used to stage copy-on-write operations should be the only records +with refcount == 1. +Single-owner file blocks aren't recorded in either the free space or the +reference count btrees. + +The high level process to rebuild the reference count btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` + records for any space having more than one reverse mapping and add them to + the xfarray. + Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray + because these are extents allocated to stage a copy on write operation and + are tracked in the refcount btree. + + Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old + refcount btree blocks. + +2. Sort the records in physical extent order, putting the CoW staging extents + at the end of the xfarray. + This matches the sorting order of records in the refcount btree. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +4. Allocate the number of blocks computed in the previous step. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +6. Commit the location of new btree root block to the AGF. + +7. Reap the old btree blocks using the bitmap created in step 1. + +Details are as follows; the same algorithm is used by ``xfs_repair`` to +generate refcount information from reverse mapping records. + +- Until the reverse mapping btree runs out of records: + + - Retrieve the next record from the btree and put it in a bag. + + - Collect all records with the same starting block from the btree and put + them in the bag. + + - While the bag isn't empty: + + - Among the mappings in the bag, compute the lowest block number where the + reference count changes. + This position will be either the starting block number of the next + unprocessed reverse mapping or the next block after the shortest mapping + in the bag. + + - Remove all mappings from the bag that end at this position. + + - Collect all reverse mappings that start at this position from the btree + and put them in the bag. + + - If the size of the bag changed and is greater than one, create a new + refcount record associating the block number range that we just walked to + the size of the bag. + +The bag-like structure in this case is a type 2 xfarray as discussed in the +:ref:`xfarray access patterns` section. +Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and +removed via ``xfarray_unset``. +Bag members are examined through ``xfarray_iter`` loops. + +The proposed patchset is the +`AG btree repair +`_ +series. + +Case Study: Rebuilding File Fork Mapping Indices +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The high level process to rebuild a data/attr fork mapping btree is: + +1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` + records from the reverse mapping records for that inode and fork. + Append these records to an xfarray. + Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` + records. + +2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for the new tree. + +3. Sort the records in file offset order. + +4. If the extent records would fit in the inode fork immediate area, commit the + records to that immediate area and skip to step 8. + +5. Allocate the number of blocks computed in the previous step. + +6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks. + +7. Commit the new btree root block to the inode fork immediate area. + +8. Reap the old btree blocks using the bitmap created in step 1. + +There are some complications here: +First, it's possible to move the fork offset to adjust the sizes of the +immediate areas if the data and attr forks are not both in BMBT format. +Second, if there are sufficiently few fork mappings, it may be possible to use +EXTENTS format instead of BMBT, which may require a conversion. +Third, the incore extent map must be reloaded carefully to avoid disturbing +any delayed allocation extents. + +The proposed patchset is the +`file mapping repair +`_ +series. + +.. _reaping: + +Reaping Old Metadata Blocks +--------------------------- + +Whenever online fsck builds a new data structure to replace one that is +suspect, there is a question of how to find and dispose of the blocks that +belonged to the old structure. +The laziest method of course is not to deal with them at all, but this slowly +leads to service degradations as space leaks out of the filesystem. +Hopefully, someone will schedule a rebuild of the free space information to +plug all those leaks. +Offline repair rebuilds all space metadata after recording the usage of +the files and directories that it decides not to clear, hence it can build new +structures in the discovered free space and avoid the question of reaping. + +As part of a repair, online fsck relies heavily on the reverse mapping records +to find space that is owned by the corresponding rmap owner yet truly free. +Cross referencing rmap records with other rmap records is necessary because +there may be other data structures that also think they own some of those +blocks (e.g. crosslinked trees). +Permitting the block allocator to hand them out again will not push the system +towards consistency. + +For space metadata, the process of finding extents to dispose of generally +follows this format: + +1. Create a bitmap of space used by data structures that must be preserved. + The space reservations used to create the new metadata can be used here if + the same rmap owner code is used to denote all of the objects being rebuilt. + +2. Survey the reverse mapping data to create a bitmap of space owned by the + same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. + +3. Use the bitmap disunion operator to subtract (1) from (2). + The remaining set bits represent candidate extents that could be freed. + The process moves on to step 4 below. + +Repairs for file-based metadata such as extended attributes, directories, +symbolic links, quota files and realtime bitmaps are performed by building a +new structure attached to a temporary file and swapping the forks. +Afterward, the mappings in the old file fork are the candidate blocks for +disposal. + +The process for disposing of old extents is as follows: + +4. For each candidate extent, count the number of reverse mapping records for + the first block in that extent that do not have the same rmap owner for the + data structure being repaired. + + - If zero, the block has a single owner and can be freed. + + - If not, the block is part of a crosslinked structure and must not be + freed. + +5. Starting with the next block in the extent, figure out how many more blocks + have the same zero/nonzero other owner status as that first block. + +6. If the region is crosslinked, delete the reverse mapping entry for the + structure being repaired and move on to the next region. + +7. If the region is to be freed, mark any corresponding buffers in the buffer + cache as stale to prevent log writeback. + +8. Free the region and move on. + +However, there is one complication to this procedure. +Transactions are of finite size, so the reaping process must be careful to roll +the transactions to avoid overruns. +Overruns come from two sources: + +a. EFIs logged on behalf of space that is no longer occupied + +b. Log items for buffer invalidations + +This is also a window in which a crash during the reaping process can leak +blocks. +As stated earlier, online repair functions use very large transactions to +minimize the chances of this occurring. + +The proposed patchset is the +`preparation for bulk loading btrees +`_ +series. + +Case Study: Reaping After a Regular Btree Repair +```````````````````````````````````````````````` + +Old reference count and inode btrees are the easiest to reap because they have +rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount +btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. +Creating a list of extents to reap the old btree blocks is quite simple, +conceptually: + +1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. + +2. For each reverse mapping record with an rmap owner corresponding to the + metadata structure being rebuilt, set the corresponding range in a bitmap. + +3. Walk the current data structures that have the same rmap owner. + For each block visited, clear that range in the above bitmap. + +4. Each set bit in the bitmap represents a block that could be a block from the + old data structures and hence is a candidate for reaping. + In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` + are the blocks that might be freeable. + +If it is possible to maintain the AGF lock throughout the repair (which is the +common case), then step 2 can be performed at the same time as the reverse +mapping record walk that creates the records for the new btree. + +Case Study: Rebuilding the Free Space Indices +````````````````````````````````````````````` + +The high level process to rebuild the free space indices is: + +1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` + records from the gaps in the reverse mapping btree. + +2. Append the records to an xfarray. + +3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number + of blocks needed for each new tree. + +4. Allocate the number of blocks computed in the previous step from the free + space information collected. + +5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and + generate the internal node blocks for the free space by length index. + Call it again for the free space by block number index. + +6. Commit the locations of the new btree root blocks to the AGF. + +7. Reap the old btree blocks by looking for space that is not recorded by the + reverse mapping btree, the new free space btrees, or the AGFL. + +Repairing the free space btrees has three key complications over a regular +btree repair: + +First, free space is not explicitly tracked in the reverse mapping records. +Hence, the new free space records must be inferred from gaps in the physical +space component of the keyspace of the reverse mapping btree. + +Second, free space repairs cannot use the common btree reservation code because +new blocks are reserved out of the free space btrees. +This is impossible when repairing the free space btrees themselves. +However, repair holds the AGF buffer lock for the duration of the free space +index reconstruction, so it can use the collected free space information to +supply the blocks for the new free space btrees. +It is not necessary to back each reserved extent with an EFI because the new +free space btrees are constructed in what the ondisk filesystem thinks is +unowned space. +However, if reserving blocks for the new btrees from the collected free space +information changes the number of free space records, repair must re-estimate +the new free space btree geometry with the new record count until the +reservation is sufficient. +As part of committing the new btrees, repair must ensure that reverse mappings +are created for the reserved blocks and that unused reserved blocks are +inserted into the free space btrees. +Deferrred rmap and freeing operations are used to ensure that this transition +is atomic, similar to the other btree repair functions. + +Third, finding the blocks to reap after the repair is not overly +straightforward. +Blocks for the free space btrees and the reverse mapping btrees are supplied by +the AGFL. +Blocks put onto the AGFL have reverse mapping records with the owner +``XFS_RMAP_OWN_AG``. +This ownership is retained when blocks move from the AGFL into the free space +btrees or the reverse mapping btrees. +When repair walks reverse mapping records to synthesize free space records, it +creates a bitmap (``ag_owner_bitmap``) of all the space claimed by +``XFS_RMAP_OWN_AG`` records. +The repair context maintains a second bitmap corresponding to the rmap btree +blocks and the AGFL blocks (``rmap_agfl_bitmap``). +When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & +~rmap_agfl_bitmap)`` computes the extents that are used by the old free space +btrees. +These blocks can then be reaped using the methods outlined above. + +The proposed patchset is the +`AG btree repair +`_ +series. + +.. _rmap_reap: + +Case Study: Reaping After Repairing Reverse Mapping Btrees +`````````````````````````````````````````````````````````` + +Old reverse mapping btrees are less difficult to reap after a repair. +As mentioned in the previous section, blocks on the AGFL, the two free space +btree blocks, and the reverse mapping btree blocks all have reverse mapping +records with ``XFS_RMAP_OWN_AG`` as the owner. +The full process of gathering reverse mapping records and building a new btree +are described in the case study of +:ref:`live rebuilds of rmap data `, but a crucial point from that +discussion is that the new rmap btree will not contain any records for the old +rmap btree, nor will the old btree blocks be tracked in the free space btrees. +The list of candidate reaping blocks is computed by setting the bits +corresponding to the gaps in the new rmap btree records, and then clearing the +bits corresponding to extents in the free space btrees and the current AGFL +blocks. +The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the +methods outlined above. + +The rest of the process of rebuildng the reverse mapping btree is discussed +in a separate :ref:`case study`. + +The proposed patchset is the +`AG btree repair +`_ +series. + +Case Study: Rebuilding the AGFL +``````````````````````````````` + +The allocation group free block list (AGFL) is repaired as follows: + +1. Create a bitmap for all the space that the reverse mapping data claims is + owned by ``XFS_RMAP_OWN_AG``. + +2. Subtract the space used by the two free space btrees and the rmap btree. + +3. Subtract any space that the reverse mapping data claims is owned by any + other owner, to avoid re-adding crosslinked blocks to the AGFL. + +4. Once the AGFL is full, reap any blocks leftover. + +5. The next operation to fix the freelist will right-size the list. + +See `fs/xfs/scrub/agheader_repair.c `_ for more details. + +Inode Record Repairs +-------------------- + +Inode records must be handled carefully, because they have both ondisk records +("dinodes") and an in-memory ("cached") representation. +There is a very high potential for cache coherency issues if online fsck is not +careful to access the ondisk metadata *only* when the ondisk metadata is so +badly damaged that the filesystem cannot load the in-memory representation. +When online fsck wants to open a damaged file for scrubbing, it must use +specialized resource acquisition functions that return either the in-memory +representation *or* a lock on whichever object is necessary to prevent any +update to the ondisk location. + +The only repairs that should be made to the ondisk inode buffers are whatever +is necessary to get the in-core structure loaded. +This means fixing whatever is caught by the inode cluster buffer and inode fork +verifiers, and retrying the ``iget`` operation. +If the second ``iget`` fails, the repair has failed. + +Once the in-memory representation is loaded, repair can lock the inode and can +subject it to comprehensive checks, repairs, and optimizations. +Most inode attributes are easy to check and constrain, or are user-controlled +arbitrary bit patterns; these are both easy to fix. +Dealing with the data and attr fork extent counts and the file block counts is +more complicated, because computing the correct value requires traversing the +forks, or if that fails, leaving the fields invalid and waiting for the fork +fsck functions to run. + +The proposed patchset is the +`inode +`_ +repair series. + +Quota Record Repairs +-------------------- + +Similar to inodes, quota records ("dquots") also have both ondisk records and +an in-memory representation, and hence are subject to the same cache coherency +issues. +Somewhat confusingly, both are known as dquots in the XFS codebase. + +The only repairs that should be made to the ondisk quota record buffers are +whatever is necessary to get the in-core structure loaded. +Once the in-memory representation is loaded, the only attributes needing +checking are obviously bad limits and timer values. + +Quota usage counters are checked, repaired, and discussed separately in the +section about :ref:`live quotacheck `. + +The proposed patchset is the +`quota +`_ +repair series. + +.. _fscounters: + +Freezing to Fix Summary Counters +-------------------------------- + +Filesystem summary counters track availability of filesystem resources such +as free blocks, free inodes, and allocated inodes. +This information could be compiled by walking the free space and inode indexes, +but this is a slow process, so XFS maintains a copy in the ondisk superblock +that should reflect the ondisk metadata, at least when the filesystem has been +unmounted cleanly. +For performance reasons, XFS also maintains incore copies of those counters, +which are key to enabling resource reservations for active transactions. +Writer threads reserve the worst-case quantities of resources from the +incore counter and give back whatever they don't use at commit time. +It is therefore only necessary to serialize on the superblock when the +superblock is being committed to disk. + +The lazy superblock counter feature introduced in XFS v5 took this even further +by training log recovery to recompute the summary counters from the AG headers, +which eliminated the need for most transactions even to touch the superblock. +The only time XFS commits the summary counters is at filesystem unmount. +To reduce contention even further, the incore counter is implemented as a +percpu counter, which means that each CPU is allocated a batch of blocks from a +global incore counter and can satisfy small allocations from the local batch. + +The high-performance nature of the summary counters makes it difficult for +online fsck to check them, since there is no way to quiesce a percpu counter +while the system is running. +Although online fsck can read the filesystem metadata to compute the correct +values of the summary counters, there's no way to hold the value of a percpu +counter stable, so it's quite possible that the counter will be out of date by +the time the walk is complete. +Earlier versions of online scrub would return to userspace with an incomplete +scan flag, but this is not a satisfying outcome for a system administrator. +For repairs, the in-memory counters must be stabilized while walking the +filesystem metadata to get an accurate reading and install it in the percpu +counter. + +To satisfy this requirement, online fsck must prevent other programs in the +system from initiating new writes to the filesystem, it must disable background +garbage collection threads, and it must wait for existing writer programs to +exit the kernel. +Once that has been established, scrub can walk the AG free space indexes, the +inode btrees, and the realtime bitmap to compute the correct value of all +four summary counters. +This is very similar to a filesystem freeze, though not all of the pieces are +necessary: + +- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to + prevent other threads from thawing the filesystem, or other scrub threads + from initiating another fscounters freeze. + +- It does not quiesce the log. + +With this code in place, it is now possible to pause the filesystem for just +long enough to check and correct the summary counters. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| The initial implementation used the actual VFS filesystem freeze | +| mechanism to quiesce filesystem activity. | +| With the filesystem frozen, it is possible to resolve the counter values | +| with exact precision, but there are many problems with calling the VFS | +| methods directly: | +| | +| - Other programs can unfreeze the filesystem without our knowledge. | +| This leads to incorrect scan results and incorrect repairs. | +| | +| - Adding an extra lock to prevent others from thawing the filesystem | +| required the addition of a ``->freeze_super`` function to wrap | +| ``freeze_fs()``. | +| This in turn caused other subtle problems because it turns out that | +| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | +| last reference to the VFS superblock, and any subsequent access | +| becomes a UAF bug! | +| This can happen if the filesystem is unmounted while the underlying | +| block device has frozen the filesystem. | +| This problem could be solved by grabbing extra references to the | +| superblock, but it felt suboptimal given the other inadequacies of | +| this approach. | +| | +| - The log need not be quiesced to check the summary counters, but a VFS | +| freeze initiates one anyway. | +| This adds unnecessary runtime to live fscounter fsck operations. | +| | +| - Quiescing the log means that XFS flushes the (possibly incorrect) | +| counters to disk as part of cleaning the log. | +| | +| - A bug in the VFS meant that freeze could complete even when | +| sync_filesystem fails to flush the filesystem and returns an error. | +| This bug was fixed in Linux 5.17. | ++--------------------------------------------------------------------------+ + +The proposed patchset is the +`summary counter cleanup +`_ +series. + +Full Filesystem Scans +--------------------- + +Certain types of metadata can only be checked by walking every file in the +entire filesystem to record observations and comparing the observations against +what's recorded on disk. +Like every other type of online repair, repairs are made by writing those +observations to disk in a replacement structure and committing it atomically. +However, it is not practical to shut down the entire filesystem to examine +hundreds of billions of files because the downtime would be excessive. +Therefore, online fsck must build the infrastructure to manage a live scan of +all the files in the filesystem. +There are two questions that need to be solved to perform a live walk: + +- How does scrub manage the scan while it is collecting data? + +- How does the scan keep abreast of changes being made to the system by other + threads? + +.. _iscan: + +Coordinated Inode Scans +``````````````````````` + +In the original Unix filesystems of the 1970s, each directory entry contained +an index number (*inumber*) which was used as an index into on ondisk array +(*itable*) of fixed-size records (*inodes*) describing a file's attributes and +its data block mapping. +This system is described by J. Lions, `"inode (5659)" +`_ in *Lions' Commentary on +UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South +Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, +`"Implementation of the File System" +`_, from *The UNIX +Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. +1913-4. + +XFS retains most of this design, except now inumbers are search keys over all +the space in the data section filesystem. +They form a continuous keyspace that can be expressed as a 64-bit integer, +though the inodes themselves are sparsely distributed within the keyspace. +Scans proceed in a linear fashion across the inumber keyspace, starting from +``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. +Naturally, a scan through a keyspace requires a scan cursor object to track the +scan progress. +Because this keyspace is sparse, this cursor contains two parts. +The first part of this scan cursor object tracks the inode that will be +examined next; call this the examination cursor. +Somewhat less obviously, the scan cursor object must also track which parts of +the keyspace have already been visited, which is critical for deciding if a +concurrent filesystem update needs to be incorporated into the scan data. +Call this the visited inode cursor. + +Advancing the scan cursor is a multi-step process encapsulated in +``xchk_iscan_iter``: + +1. Lock the AGI buffer of the AG containing the inode pointed to by the visited + inode cursor. + This guarantee that inodes in this AG cannot be allocated or freed while + advancing the cursor. + +2. Use the per-AG inode btree to look up the next inumber after the one that + was just visited, since it may not be keyspace adjacent. + +3. If there are no more inodes left in this AG: + + a. Move the examination cursor to the point of the inumber keyspace that + corresponds to the start of the next AG. + + b. Adjust the visited inode cursor to indicate that it has "visited" the + last possible inode in the current AG's inode keyspace. + XFS inumbers are segmented, so the cursor needs to be marked as having + visited the entire keyspace up to just before the start of the next AG's + inode keyspace. + + c. Unlock the AGI and return to step 1 if there are unexamined AGs in the + filesystem. + + d. If there are no more AGs to examine, set both cursors to the end of the + inumber keyspace. + The scan is now complete. + +4. Otherwise, there is at least one more inode to scan in this AG: + + a. Move the examination cursor ahead to the next inode marked as allocated + by the inode btree. + + b. Adjust the visited inode cursor to point to the inode just prior to where + the examination cursor is now. + Because the scanner holds the AGI buffer lock, no inodes could have been + created in the part of the inode keyspace that the visited inode cursor + just advanced. + +5. Get the incore inode for the inumber of the examination cursor. + By maintaining the AGI buffer lock until this point, the scanner knows that + it was safe to advance the examination cursor across the entire keyspace, + and that it has stabilized this next inode so that it cannot disappear from + the filesystem until the scan releases the incore inode. + +6. Drop the AGI lock and return the incore inode to the caller. + +Online fsck functions scan all files in the filesystem as follows: + +1. Start a scan by calling ``xchk_iscan_start``. + +2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. + If one is provided: + + a. Lock the inode to prevent updates during the scan. + + b. Scan the inode. + + c. While still holding the inode lock, adjust the visited inode cursor + (``xchk_iscan_mark_visited``) to point to this inode. + + d. Unlock and release the inode. + +8. Call ``xchk_iscan_teardown`` to complete the scan. + +There are subtleties with the inode cache that complicate grabbing the incore +inode for the caller. +Obviously, it is an absolute requirement that the inode metadata be consistent +enough to load it into the inode cache. +Second, if the incore inode is stuck in some intermediate state, the scan +coordinator must release the AGI and push the main filesystem to get the inode +back into a loadable state. + +The proposed patches are the +`inode scanner +`_ +series. +The first user of the new functionality is the +`online quotacheck +`_ +series. + +Inode Management +```````````````` + +In regular filesystem code, references to allocated XFS incore inodes are +always obtained (``xfs_iget``) outside of transaction context because the +creation of the incore context for an existing file does not require metadata +updates. +However, it is important to note that references to incore inodes obtained as +part of file creation must be performed in transaction context because the +filesystem must ensure the atomicity of the ondisk inode btree index updates +and the initialization of the actual ondisk inode. + +References to incore inodes are always released (``xfs_irele``) outside of +transaction context because there are a handful of activities that might +require ondisk updates: + +- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode + release. + +- Speculative preallocations need to be unreserved. + +- An unlinked file may have lost its last reference, in which case the entire + file must be inactivated, which involves releasing all of its resources in + the ondisk metadata and freeing the inode. + +These activities are collectively called inode inactivation. +Inactivation has two parts -- the VFS part, which initiates writeback on all +dirty file pages, and the XFS part, which cleans up XFS-specific information +and frees the inode if it was unlinked. +If the inode is unlinked (or unconnected after a file handle operation), the +kernel drops the inode into the inactivation machinery immediately. + +During normal operation, resource acquisition for an update follows this order +to avoid deadlocks: + +1. Inode reference (``iget``). + +2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). + +3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. + +4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that + can update page cache mappings. + +5. Log feature enablement. + +6. Transaction log space grant. + +7. Space on the data and realtime devices for the transaction. + +8. Incore dquot references, if a file is being repaired. + Note that they are not locked, merely acquired. + +9. Inode ``ILOCK`` for file metadata updates. + +10. AG header buffer locks / Realtime metadata inode ILOCK. + +11. Realtime metadata buffer locks, if applicable. + +12. Extent mapping btree blocks, if applicable. + +Resources are often released in the reverse order, though this is not required. +However, online fsck differs from regular XFS operations because it may examine +an object that normally is acquired in a later stage of the locking order, and +then decide to cross-reference the object with an object that is acquired +earlier in the order. +The next few sections detail the specific ways in which online fsck takes care +to avoid deadlocks. + +iget and irele During a Scrub +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An inode scan performed on behalf of a scrub operation runs in transaction +context, and possibly with resources already locked and bound to it. +This isn't much of a problem for ``iget`` since it can operate in the context +of an existing transaction, as long as all of the bound resources are acquired +before the inode reference in the regular filesystem. + +When the VFS ``iput`` function is given a linked inode with no other +references, it normally puts the inode on an LRU list in the hope that it can +save time if another process re-opens the file before the system runs out +of memory and frees it. +Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` +flag on the inode to cause the kernel to try to drop the inode into the +inactivation machinery immediately. + +In the past, inactivation was always done from the process that dropped the +inode, which was a problem for scrub because scrub may already hold a +transaction, and XFS does not support nesting transactions. +On the other hand, if there is no scrub transaction, it is desirable to drop +otherwise unused inodes immediately to avoid polluting caches. +To capture these nuances, the online fsck code has a separate ``xchk_irele`` +function to set or clear the ``DONTCACHE`` flag to get the required release +behavior. + +Proposed patchsets include fixing +`scrub iget usage +`_ and +`dir iget usage +`_. + +.. _ilocking: + +Locking Inodes +^^^^^^^^^^^^^^ + +In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks +in a well-known order: parent → child when updating the directory tree, and +in numerical order of the addresses of their ``struct inode`` object otherwise. +For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page +faults. +If two MMAPLOCKs must be acquired, they are acquired in numerical order of +the addresses of their ``struct address_space`` objects. +Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be +acquired before transactions are allocated. +If two ILOCKs must be acquired, they are acquired in inumber order. + +Inode lock acquisition must be done carefully during a coordinated inode scan. +Online fsck cannot abide these conventions, because for a directory tree +scanner, the scrub process holds the IOLOCK of the file being scanned and it +needs to take the IOLOCK of the file at the other end of the directory link. +If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` +cannot use the regular inode locking functions and avoid becoming trapped in an +ABBA deadlock. + +Solving both of these problems is straightforward -- any time online fsck +needs to take a second lock of the same class, it uses trylock to avoid an ABBA +deadlock. +If the trylock fails, scrub drops all inode locks and use trylock loops to +(re)acquire all necessary resources. +Trylock loops enable scrub to check for pending fatal signals, which is how +scrub avoids deadlocking the filesystem or becoming an unresponsive process. +However, trylock loops means that online fsck must be prepared to measure the +resource being scrubbed before and after the lock cycle to detect changes and +react accordingly. + +.. _dirparent: + +Case Study: Finding a Directory Parent +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Consider the directory parent pointer repair code as an example. +Online fsck must verify that the dotdot dirent of a directory points up to a +parent directory, and that the parent directory contains exactly one dirent +pointing down to the child directory. +Fully validating this relationship (and repairing it if possible) requires a +walk of every directory on the filesystem while holding the child locked, and +while updates to the directory tree are being made. +The coordinated inode scan provides a way to walk the filesystem without the +possibility of missing an inode. +The child directory is kept locked to prevent updates to the dotdot dirent, but +if the scanner fails to lock a parent, it can drop and relock both the child +and the prospective parent. +If the dotdot entry changes while the directory is unlocked, then a move or +rename operation must have changed the child's parentage, and the scan can +exit early. + +The proposed patchset is the +`directory repair +`_ +series. + +.. _fshooks: + +Filesystem Hooks +````````````````` + +The second piece of support that online fsck functions need during a full +filesystem scan is the ability to stay informed about updates being made by +other threads in the filesystem, since comparisons against the past are useless +in a dynamic environment. +Two pieces of Linux kernel infrastructure enable online fsck to monitor regular +filesystem operations: filesystem hooks and :ref:`static keys`. + +Filesystem hooks convey information about an ongoing filesystem operation to +a downstream consumer. +In this case, the downstream consumer is always an online fsck function. +Because multiple fsck functions can run in parallel, online fsck uses the Linux +notifier call chain facility to dispatch updates to any number of interested +fsck processes. +Call chains are a dynamic list, which means that they can be configured at +run time. +Because these hooks are private to the XFS module, the information passed along +contains exactly what the checking function needs to update its observations. + +The current implementation of XFS hooks uses SRCU notifier chains to reduce the +impact to highly threaded workloads. +Regular blocking notifier chains use a rwsem and seem to have a much lower +overhead for single-threaded applications. +However, it may turn out that the combination of blocking chains and static +keys are a more performant combination; more study is needed here. + +The following pieces are necessary to hook a certain point in the filesystem: + +- A ``struct xfs_hooks`` object must be embedded in a convenient place such as + a well-known incore filesystem object. + +- Each hook must define an action code and a structure containing more context + about the action. + +- Hook providers should provide appropriate wrapper functions and structs + around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type + checking to ensure correct usage. + +- A callsite in the regular filesystem code must be chosen to call + ``xfs_hooks_call`` with the action code and data structure. + This place should be adjacent to (and not earlier than) the place where + the filesystem update is committed to the transaction. + In general, when the filesystem calls a hook chain, it should be able to + handle sleeping and should not be vulnerable to memory reclaim or locking + recursion. + However, the exact requirements are very dependent on the context of the hook + caller and the callee. + +- The online fsck function should define a structure to hold scan data, a lock + to coordinate access to the scan data, and a ``struct xfs_hook`` object. + The scanner function and the regular filesystem code must acquire resources + in the same order; see the next section for details. + +- The online fsck code must contain a C function to catch the hook action code + and data structure. + If the object being updated has already been visited by the scan, then the + hook information must be applied to the scan data. + +- Prior to unlocking inodes to start the scan, online fsck must call + ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and + ``xfs_hooks_add`` to enable the hook. + +- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is + complete. + +The number of hooks should be kept to a minimum to reduce complexity. +Static keys are used to reduce the overhead of filesystem hooks to nearly +zero when online fsck is not running. + +.. _liveupdate: + +Live Updates During a Scan +`````````````````````````` + +The code paths of the online fsck scanning code and the :ref:`hooked` +filesystem code look like this:: + + other program + ↓ + inode lock ←────────────────────┐ + ↓ │ + AG header lock │ + ↓ │ + filesystem function │ + ↓ │ + notifier call chain │ same + ↓ ├─── inode + scrub hook function │ lock + ↓ │ + scan data mutex ←──┐ same │ + ↓ ├─── scan │ + update scan data │ lock │ + ↑ │ │ + scan data mutex ←──┘ │ + ↑ │ + inode lock ←────────────────────┘ + ↑ + scrub function + ↑ + inode scanner + ↑ + xfs_scrub + +These rules must be followed to ensure correct interactions between the +checking code and the code making an update to the filesystem: + +- Prior to invoking the notifier call chain, the filesystem function being + hooked must acquire the same lock that the scrub scanning function acquires + to scan the inode. + +- The scanning function and the scrub hook function must coordinate access to + the scan data by acquiring a lock on the scan data. + +- Scrub hook function must not add the live update information to the scan + observations unless the inode being updated has already been scanned. + The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) + for this. + +- Scrub hook functions must not change the caller's state, including the + transaction that it is running. + They must not acquire any resources that might conflict with the filesystem + function being hooked. + +- The hook function can abort the inode scan to avoid breaking the other rules. + +The inode scan APIs are pretty simple: + +- ``xchk_iscan_start`` starts a scan + +- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or + returns zero if there is nothing left to scan + +- ``xchk_iscan_want_live_update`` to decide if an inode has already been + visited in the scan. + This is critical for hook functions to decide if they need to update the + in-memory scan information. + +- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the + scan + +- ``xchk_iscan_teardown`` to finish the scan + +This functionality is also a part of the +`inode scanner +`_ +series. + +.. _quotacheck: + +Case Study: Quota Counter Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is useful to compare the mount time quotacheck code to the online repair +quotacheck code. +Mount time quotacheck does not have to contend with concurrent operations, so +it does the following: + +1. Make sure the ondisk dquots are in good enough shape that all the incore + dquots will actually load, and zero the resource usage counters in the + ondisk buffer. + +2. Walk every inode in the filesystem. + Add each file's resource usage to the incore dquot. + +3. Walk each incore dquot. + If the incore dquot is not being flushed, add the ondisk buffer backing the + incore dquot to a delayed write (delwri) list. + +4. Write the buffer list to disk. + +Like most online fsck functions, online quotacheck can't write to regular +filesystem objects until the newly collected metadata reflect all filesystem +state. +Therefore, online quotacheck records file resource usage to a shadow dquot +index implemented with a sparse ``xfarray``, and only writes to the real dquots +once the scan is complete. +Handling transactional updates is tricky because quota resource usage updates +are handled in phases to minimize contention on dquots: + +1. The inodes involved are joined and locked to a transaction. + +2. For each dquot attached to the file: + + a. The dquot is locked. + + b. A quota reservation is added to the dquot's resource usage. + The reservation is recorded in the transaction. + + c. The dquot is unlocked. + +3. Changes in actual quota usage are tracked in the transaction. + +4. At transaction commit time, each dquot is examined again: + + a. The dquot is locked again. + + b. Quota usage changes are logged and unused reservation is given back to + the dquot. + + c. The dquot is unlocked. + +For online quotacheck, hooks are placed in steps 2 and 4. +The step 2 hook creates a shadow version of the transaction dquot context +(``dqtrx``) that operates in a similar manner to the regular code. +The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. +Notice that both hooks are called with the inode locked, which is how the +live update coordinates with the inode scanner. + +The quotacheck scan looks like this: + +1. Set up a coordinated inode scan. + +2. For each inode returned by the inode scan iterator: + + a. Grab and lock the inode. + + b. Determine that inode's resource usage (data blocks, inode counts, + realtime blocks) and add that to the shadow dquots for the user, group, + and project ids associated with the inode. + + c. Unlock and release the inode. + +3. For each dquot in the system: + + a. Grab and lock the dquot. + + b. Check the dquot against the shadow dquots created by the scan and updated + by the live hooks. + +Live updates are key to being able to walk every quota record without +needing to hold any locks for a long duration. +If repairs are desired, the real and shadow dquots are locked and their +resource counts are set to the values in the shadow dquot. + +The proposed patchset is the +`online quotacheck +`_ +series. + +.. _nlinks: + +Case Study: File Link Count Checking +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +File link count checking also uses live update hooks. +The coordinated inode scanner is used to visit all directories on the +filesystem, and per-file link count records are stored in a sparse ``xfarray`` +indexed by inumber. +During the scanning phase, each entry in a directory generates observation +data as follows: + +1. If the entry is a dotdot (``'..'``) entry of the root directory, the + directory's parent link count is bumped because the root directory's dotdot + entry is self referential. + +2. If the entry is a dotdot entry of a subdirectory, the parent's backref + count is bumped. + +3. If the entry is neither a dot nor a dotdot entry, the target file's parent + count is bumped. + +4. If the target is a subdirectory, the parent's child link count is bumped. + +A crucial point to understand about how the link count inode scanner interacts +with the live update hooks is that the scan cursor tracks which *parent* +directories have been scanned. +In other words, the live updates ignore any update about ``A → B`` when A has +not been scanned, even if B has been scanned. +Furthermore, a subdirectory A with a dotdot entry pointing back to B is +accounted as a backref counter in the shadow data for A, since child dotdot +entries affect the parent's link count. +Live update hooks are carefully placed in all parts of the filesystem that +create, change, or remove directory entries, since those operations involve +bumplink and droplink. + +For any file, the correct link count is the number of parents plus the number +of child subdirectories. +Non-directories never have children of any kind. +The backref information is used to detect inconsistencies in the number of +links pointing to child subdirectories and the number of dotdot entries +pointing back. + +After the scan completes, the link count of each file can be checked by locking +both the inode and the shadow data, and comparing the link counts. +A second coordinated inode scan cursor is used for comparisons. +Live updates are key to being able to walk every inode without needing to hold +any locks between inodes. +If repairs are desired, the inode's link count is set to the value in the +shadow information. +If no parents are found, the file must be :ref:`reparented ` to the +orphanage to prevent the file from being lost forever. + +The proposed patchset is the +`file link count repair +`_ +series. + +.. _rmap_repair: + +Case Study: Rebuilding Reverse Mapping Records +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Most repair functions follow the same pattern: lock filesystem resources, +walk the surviving ondisk metadata looking for replacement metadata records, +and use an :ref:`in-memory array ` to store the gathered observations. +The primary advantage of this approach is the simplicity and modularity of the +repair code -- code and data are entirely contained within the scrub module, +do not require hooks in the main filesystem, and are usually the most efficient +in memory use. +A secondary advantage of this repair approach is atomicity -- once the kernel +decides a structure is corrupt, no other threads can access the metadata until +the kernel finishes repairing and revalidating the metadata. + +For repairs going on within a shard of the filesystem, these advantages +outweigh the delays inherent in locking the shard while repairing parts of the +shard. +Unfortunately, repairs to the reverse mapping btree cannot use the "standard" +btree repair strategy because it must scan every space mapping of every fork of +every file in the filesystem, and the filesystem cannot stop. +Therefore, rmap repair foregoes atomicity between scrub and repair. +It combines a :ref:`coordinated inode scanner `, :ref:`live update hooks +`, and an :ref:`in-memory rmap btree ` to complete the +scan for reverse mapping records. + +1. Set up an xfbtree to stage rmap records. + +2. While holding the locks on the AGI and AGF buffers acquired during the + scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW + staging extents, and the internal log. + +3. Set up an inode scanner. + +4. Hook into rmap updates for the AG being repaired so that the live scan data + can receive updates to the rmap btree from the rest of the filesystem during + the file scan. + +5. For each space mapping found in either fork of each file scanned, + decide if the mapping matches the AG of interest. + If so: + + a. Create a btree cursor for the in-memory btree. + + b. Use the rmap code to add the record to the in-memory btree. + + c. Use the :ref:`special commit function ` to write the + xfbtree changes to the xfile. + +6. For each live update received via the hook, decide if the owner has already + been scanned. + If so, apply the live update into the scan data: + + a. Create a btree cursor for the in-memory btree. + + b. Replay the operation into the in-memory btree. + + c. Use the :ref:`special commit function ` to write the + xfbtree changes to the xfile. + This is performed with an empty transaction to avoid changing the + caller's state. + +7. When the inode scan finishes, create a new scrub transaction and relock the + two AG headers. + +8. Compute the new btree geometry using the number of rmap records in the + shadow btree, like all other btree rebuilding functions. + +9. Allocate the number of blocks computed in the previous step. + +10. Perform the usual btree bulk loading and commit to install the new rmap + btree. + +11. Reap the old rmap btree blocks as discussed in the case study about how + to :ref:`reap after rmap btree repair `. + +12. Free the xfbtree now that it not needed. + +The proposed patchset is the +`rmap repair +`_ +series. + +Staging Repairs with Temporary Files on Disk +-------------------------------------------- + +XFS stores a substantial amount of metadata in file forks: directories, +extended attributes, symbolic link targets, free space bitmaps and summary +information for the realtime volume, and quota records. +File forks map 64-bit logical file fork space extents to physical storage space +extents, similar to how a memory management unit maps 64-bit virtual addresses +to physical memory addresses. +Therefore, file-based tree structures (such as directories and extended +attributes) use blocks mapped in the file fork offset address space that point +to other blocks mapped within that same address space, and file-based linear +structures (such as bitmaps and quota records) compute array element offsets in +the file fork offset address space. + +Because file forks can consume as much space as the entire filesystem, repairs +cannot be staged in memory, even when a paging scheme is available. +Therefore, online repair of file-based metadata createas a temporary file in +the XFS filesystem, writes a new structure at the correct offsets into the +temporary file, and atomically swaps the fork mappings (and hence the fork +contents) to commit the repair. +Once the repair is complete, the old fork can be reaped as necessary; if the +system goes down during the reap, the iunlink code will delete the blocks +during log recovery. + +**Note**: All space usage and inode indices in the filesystem *must* be +consistent to use a temporary file safely! +This dependency is the reason why online repair can only use pageable kernel +memory to stage ondisk space usage information. + +Swapping metadata extents with a temporary file requires the owner field of the +block headers to match the file being repaired and not the temporary file. The +directory, extended attribute, and symbolic link functions were all modified to +allow callers to specify owner numbers explicitly. + +There is a downside to the reaping process -- if the system crashes during the +reap phase and the fork extents are crosslinked, the iunlink processing will +fail because freeing space will find the extra reverse mappings and abort. + +Temporary files created for repair are similar to ``O_TMPFILE`` files created +by userspace. +They are not linked into a directory and the entire file will be reaped when +the last reference to the file is lost. +The key differences are that these files must have no access permission outside +the kernel at all, they must be specially marked to prevent them from being +opened by handle, and they must never be linked into the directory tree. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| In the initial iteration of file metadata repair, the damaged metadata | +| blocks would be scanned for salvageable data; the extents in the file | +| fork would be reaped; and then a new structure would be built in its | +| place. | +| This strategy did not survive the introduction of the atomic repair | +| requirement expressed earlier in this document. | +| | +| The second iteration explored building a second structure at a high | +| offset in the fork from the salvage data, reaping the old extents, and | +| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | +| place. | +| | +| This had many drawbacks: | +| | +| - Array structures are linearly addressed, and the regular filesystem | +| codebase does not have the concept of a linear offset that could be | +| applied to the record offset computation to build an alternate copy. | +| | +| - Extended attributes are allowed to use the entire attr fork offset | +| address space. | +| | +| - Even if repair could build an alternate copy of a data structure in a | +| different part of the fork address space, the atomic repair commit | +| requirement means that online repair would have to be able to perform | +| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | +| structure was completely replaced. | +| | +| - A crash after construction of the secondary tree but before the range | +| collapse would leave unreachable blocks in the file fork. | +| This would likely confuse things further. | +| | +| - Reaping blocks after a repair is not a simple operation, and | +| initiating a reap operation from a restarted range collapse operation | +| during log recovery is daunting. | +| | +| - Directory entry blocks and quota records record the file fork offset | +| in the header area of each block. | +| An atomic range collapse operation would have to rewrite this part of | +| each block header. | +| Rewriting a single field in block headers is not a huge problem, but | +| it's something to be aware of. | +| | +| - Each block in a directory or extended attributes btree index contains | +| sibling and child block pointers. | +| Were the atomic commit to use a range collapse operation, each block | +| would have to be rewritten very carefully to preserve the graph | +| structure. | +| Doing this as part of a range collapse means rewriting a large number | +| of blocks repeatedly, which is not conducive to quick repairs. | +| | +| This lead to the introduction of temporary file staging. | ++--------------------------------------------------------------------------+ + +Using a Temporary File +`````````````````````` + +Online repair code should use the ``xrep_tempfile_create`` function to create a +temporary file inside the filesystem. +This allocates an inode, marks the in-core inode private, and attaches it to +the scrub context. +These files are hidden from userspace, may not be added to the directory tree, +and must be kept private. + +Temporary files only use two inode locks: the IOLOCK and the ILOCK. +The MMAPLOCK is not needed here, because there must not be page faults from +userspace for data fork blocks. +The usage patterns of these two locks are the same as for any other XFS file -- +access to file data are controlled via the IOLOCK, and access to file metadata +are controlled via the ILOCK. +Locking helpers are provided so that the temporary file and its lock state can +be cleaned up by the scrub context. +To comply with the nested locking strategy laid out in the :ref:`inode +locking` section, it is recommended that scrub functions use the +xrep_tempfile_ilock*_nowait lock helpers. + +Data can be written to a temporary file by two means: + +1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular + temporary file from an xfile. + +2. The regular directory, symbolic link, and extended attribute functions can + be used to write to the temporary file. + +Once a good copy of a data file has been constructed in a temporary file, it +must be conveyed to the file being repaired, which is the topic of the next +section. + +The proposed patches are in the +`repair temporary files +`_ +series. + +Atomic Extent Swapping +---------------------- + +Once repair builds a temporary file with a new data structure written into +it, it must commit the new changes into the existing file. +It is not possible to swap the inumbers of two files, so instead the new +metadata must replace the old. +This suggests the need for the ability to swap extents, but the existing extent +swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient +for online repair because: + +a. When the reverse-mapping btree is enabled, the swap code must keep the + reverse mapping information up to date with every exchange of mappings. + Therefore, it can only exchange one mapping per transaction, and each + transaction is independent. + +b. Reverse-mapping is critical for the operation of online fsck, so the old + defragmentation code (which swapped entire extent forks in a single + operation) is not useful here. + +c. Defragmentation is assumed to occur between two files with identical + contents. + For this use case, an incomplete exchange will not result in a user-visible + change in file contents, even if the operation is interrupted. + +d. Online repair needs to swap the contents of two files that are by definition + *not* identical. + For directory and xattr repairs, the user-visible contents might be the + same, but the contents of individual blocks may be very different. + +e. Old blocks in the file may be cross-linked with another structure and must + not reappear if the system goes down mid-repair. + +These problems are overcome by creating a new deferred operation and a new type +of log intent item to track the progress of an operation to exchange two file +ranges. +The new deferred operation type chains together the same transactions used by +the reverse-mapping extent swap code. +The new log item records the progress of the exchange to ensure that once an +exchange begins, it will always run to completion, even there are +interruptions. +The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag +in the superblock protects these new log item records from being replayed on +old kernels. + +The proposed patchset is the +`atomic extent swap +`_ +series. + ++--------------------------------------------------------------------------+ +| **Sidebar: Using Log-Incompatible Feature Flags** | ++--------------------------------------------------------------------------+ +| Starting with XFS v5, the superblock contains a | +| ``sb_features_log_incompat`` field to indicate that the log contains | +| records that might not readable by all kernels that could mount this | +| filesystem. | +| In short, log incompat features protect the log contents against kernels | +| that will not understand the contents. | +| Unlike the other superblock feature bits, log incompat bits are | +| ephemeral because an empty (clean) log does not need protection. | +| The log cleans itself after its contents have been committed into the | +| filesystem, either as part of an unmount or because the system is | +| otherwise idle. | +| Because upper level code can be working on a transaction at the same | +| time that the log cleans itself, it is necessary for upper level code to | +| communicate to the log when it is going to use a log incompatible | +| feature. | +| | +| The log coordinates access to incompatible features through the use of | +| one ``struct rw_semaphore`` for each feature. | +| The log cleaning code tries to take this rwsem in exclusive mode to | +| clear the bit; if the lock attempt fails, the feature bit remains set. | +| Filesystem code signals its intention to use a log incompat feature in a | +| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | +| in shared mode. | +| The code supporting a log incompat feature should create wrapper | +| functions to obtain the log feature and call | +| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | +| superblock. | +| The superblock update is performed transactionally, so the wrapper to | +| obtain log assistance must be called just prior to the creation of the | +| transaction that uses the functionality. | +| For a file operation, this step must happen after taking the IOLOCK | +| and the MMAPLOCK, but before allocating the transaction. | +| When the transaction is complete, the ``xlog_drop_incompat_feat`` | +| function is called to release the feature. | +| The feature bit will not be cleared from the superblock until the log | +| becomes clean. | +| | +| Log-assisted extended attribute updates and atomic extent swaps both use | +| log incompat features and provide convenience wrappers around the | +| functionality. | ++--------------------------------------------------------------------------+ + +Mechanics of an Atomic Extent Swap +`````````````````````````````````` + +Swapping entire file forks is a complex task. +The goal is to exchange all file fork mappings between two file fork offset +ranges. +There are likely to be many extent mappings in each fork, and the edges of +the mappings aren't necessarily aligned. +Furthermore, there may be other updates that need to happen after the swap, +such as exchanging file sizes, inode flags, or conversion of fork data to local +format. +This is roughly the format of the new deferred extent swap work item: + +.. code-block:: c + + struct xfs_swapext_intent { + /* Inodes participating in the operation. */ + struct xfs_inode *sxi_ip1; + struct xfs_inode *sxi_ip2; + + /* File offset range information. */ + xfs_fileoff_t sxi_startoff1; + xfs_fileoff_t sxi_startoff2; + xfs_filblks_t sxi_blockcount; + + /* Set these file sizes after the operation, unless negative. */ + xfs_fsize_t sxi_isize1; + xfs_fsize_t sxi_isize2; + + /* XFS_SWAP_EXT_* log operation flags */ + uint64_t sxi_flags; + }; + +The new log intent item contains enough information to track two logical fork +offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, +blockcount)``. +Each step of a swap operation exchanges the largest file range mapping possible +from one file to the other. +After each step in the swap operation, the two startoff fields are incremented +and the blockcount field is decremented to reflect the progress made. +The flags field captures behavioral parameters such as swapping the attr fork +instead of the data fork and other work to be done after the extent swap. +The two isize fields are used to swap the file size at the end of the operation +if the file data fork is the target of the swap operation. + +When the extent swap is initiated, the sequence of operations is as follows: + +1. Create a deferred work item for the extent swap. + At the start, it should contain the entirety of the file ranges to be + swapped. + +2. Call ``xfs_defer_finish`` to process the exchange. + This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. + This will log an extent swap intent item to the transaction for the deferred + extent swap work item. + +3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, + + a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and + ``sxi_startoff2``, respectively, and compute the longest extent that can + be swapped in a single step. + This is the minimum of the two ``br_blockcount`` s in the mappings. + Keep advancing through the file forks until at least one of the mappings + contains written blocks. + Mutual holes, unwritten extents, and extent mappings to the same physical + space are not exchanged. + + For the next few steps, this document will refer to the mapping that came + from file 1 as "map1", and the mapping that came from file 2 as "map2". + + b. Create a deferred block mapping update to unmap map1 from file 1. + + c. Create a deferred block mapping update to unmap map2 from file 2. + + d. Create a deferred block mapping update to map map1 into file 2. + + e. Create a deferred block mapping update to map map2 into file 1. + + f. Log the block, quota, and extent count updates for both files. + + g. Extend the ondisk size of either file if necessary. + + h. Log an extent swap done log item for the extent swap intent log item + that was read at the start of step 3. + + i. Compute the amount of file range that has just been covered. + This quantity is ``(map1.br_startoff + map1.br_blockcount - + sxi_startoff1)``, because step 3a could have skipped holes. + + j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` + by the number of blocks computed in the previous step, and decrease + ``sxi_blockcount`` by the same quantity. + This advances the cursor. + + k. Log a new extent swap intent log item reflecting the advanced state of + the work item. + + l. Return the proper error code (EAGAIN) to the deferred operation manager + to inform it that there is more work to be done. + The operation manager completes the deferred work in steps 3b-3e before + moving back to the start of step 3. + +4. Perform any post-processing. + This will be discussed in more detail in subsequent sections. + +If the filesystem goes down in the middle of an operation, log recovery will +find the most recent unfinished extent swap log intent item and restart from +there. +This is how extent swapping guarantees that an outside observer will either see +the old broken structure or the new one, and never a mismash of both. + +Preparation for Extent Swapping +``````````````````````````````` + +There are a few things that need to be taken care of before initiating an +atomic extent swap operation. +First, regular files require the page cache to be flushed to disk before the +operation begins, and directio writes to be quiesced. +Like any filesystem operation, extent swapping must determine the maximum +amount of disk space and quota that can be consumed on behalf of both files in +the operation, and reserve that quantity of resources to avoid an unrecoverable +out of space failure once it starts dirtying metadata. +The preparation step scans the ranges of both files to estimate: + +- Data device blocks needed to handle the repeated updates to the fork + mappings. +- Change in data and realtime block counts for both files. +- Increase in quota usage for both files, if the two files do not share the + same set of quota ids. +- The number of extent mappings that will be added to each file. +- Whether or not there are partially written realtime extents. + User programs must never be able to access a realtime file extent that maps + to different extents on the realtime volume, which could happen if the + operation fails to run to completion. + +The need for precise estimation increases the run time of the swap operation, +but it is very important to maintain correct accounting. +The filesystem must not run completely out of free space, nor can the extent +swap ever add more extent mappings to a fork than it can support. +Regular users are required to abide the quota limits, though metadata repairs +may exceed quota to resolve inconsistent metadata elsewhere. + +Special Features for Swapping Metadata File Extents +``````````````````````````````````````````````````` + +Extended attributes, symbolic links, and directories can set the fork format to +"local" and treat the fork as a literal area for data storage. +Metadata repairs must take extra steps to support these cases: + +- If both forks are in local format and the fork areas are large enough, the + swap is performed by copying the incore fork contents, logging both forks, + and committing. + The atomic extent swap mechanism is not necessary, since this can be done + with a single transaction. + +- If both forks map blocks, then the regular atomic extent swap is used. + +- Otherwise, only one fork is in local format. + The contents of the local format fork are converted to a block to perform the + swap. + The conversion to block format must be done in the same transaction that + logs the initial extent swap intent log item. + The regular atomic extent swap is used to exchange the mappings. + Special flags are set on the swap operation so that the transaction can be + rolled one more time to convert the second file's fork back to local format + so that the second file will be ready to go as soon as the ILOCK is dropped. + +Extended attributes and directories stamp the owning inode into every block, +but the buffer verifiers do not actually check the inode number! +Although there is no verification, it is still important to maintain +referential integrity, so prior to performing the extent swap, online repair +builds every block in the new data structure with the owner field of the file +being repaired. + +After a successful swap operation, the repair operation must reap the old fork +blocks by processing each fork mapping through the standard :ref:`file extent +reaping ` mechanism that is done post-repair. +If the filesystem should go down during the reap part of the repair, the +iunlink processing at the end of recovery will free both the temporary file and +whatever blocks were not reaped. +However, this iunlink processing omits the cross-link detection of online +repair, and is not completely foolproof. + +Swapping Temporary File Extents +``````````````````````````````` + +To repair a metadata file, online repair proceeds as follows: + +1. Create a temporary repair file. + +2. Use the staging data to write out new contents into the temporary repair + file. + The same fork must be written to as is being repaired. + +3. Commit the scrub transaction, since the swap estimation step must be + completed before transaction reservations are made. + +4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with + the appropriate resource reservations, locks, and fill out a ``struct + xfs_swapext_req`` with the details of the swap operation. + +5. Call ``xrep_tempswap_contents`` to swap the contents. + +6. Commit the transaction to complete the repair. + +.. _rtsummary: + +Case Study: Repairing the Realtime Summary File +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the "realtime" section of an XFS filesystem, free space is tracked via a +bitmap, similar to Unix FFS. +Each bit in the bitmap represents one realtime extent, which is a multiple of +the filesystem block size between 4KiB and 1GiB in size. +The realtime summary file indexes the number of free extents of a given size to +the offset of the block within the realtime free space bitmap where those free +extents begin. +In other words, the summary file helps the allocator find free extents by +length, similar to what the free space by count (cntbt) btree does for the data +section. + +The summary file itself is a flat file (with no block headers or checksums!) +partitioned into ``log2(total rt extents)`` sections containing enough 32-bit +counters to match the number of blocks in the rt bitmap. +Each counter records the number of free extents that start in that bitmap block +and can satisfy a power-of-two allocation request. + +To check the summary file against the bitmap: + +1. Take the ILOCK of both the realtime bitmap and summary files. + +2. For each free space extent recorded in the bitmap: + + a. Compute the position in the summary file that contains a counter that + represents this free extent. + + b. Read the counter from the xfile. + + c. Increment it, and write it back to the xfile. + +3. Compare the contents of the xfile against the ondisk file. + +To repair the summary file, write the xfile contents into the temporary file +and use atomic extent swap to commit the new contents. +The temporary file is then reaped. + +The proposed patchset is the +`realtime summary repair +`_ +series. + +Case Study: Salvaging Extended Attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In XFS, extended attributes are implemented as a namespaced name-value store. +Values are limited in size to 64KiB, but there is no limit in the number of +names. +The attribute fork is unpartitioned, which means that the root of the attribute +structure is always in logical block zero, but attribute leaf blocks, dabtree +index blocks, and remote value blocks are intermixed. +Attribute leaf blocks contain variable-sized records that associate +user-provided names with the user-provided values. +Values larger than a block are allocated separate extents and written there. +If the leaf information expands beyond a single block, a directory/attribute +btree (``dabtree``) is created to map hashes of attribute names to entries +for fast lookup. + +Salvaging extended attributes is done as follows: + +1. Walk the attr fork mappings of the file being repaired to find the attribute + leaf blocks. + When one is found, + + a. Walk the attr leaf block to find candidate keys. + When one is found, + + 1. Check the name for problems, and ignore the name if there are. + + 2. Retrieve the value. + If that succeeds, add the name and value to the staging xfarray and + xfblob. + +2. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more attr fork blocks to examine, unlock the file and + add the staged extended attributes to the temporary file. + +3. Use atomic extent swapping to exchange the new and old extended attribute + structures. + The old attribute blocks are now attached to the temporary file. + +4. Reap the temporary file. + +The proposed patchset is the +`extended attribute repair +`_ +series. + +Fixing Directories +------------------ + +Fixing directories is difficult with currently available filesystem features, +since directory entries are not redundant. +The offline repair tool scans all inodes to find files with nonzero link count, +and then it scans all directories to establish parentage of those linked files. +Damaged files and directories are zapped, and files with no parent are +moved to the ``/lost+found`` directory. +It does not try to salvage anything. + +The best that online repair can do at this time is to read directory data +blocks and salvage any dirents that look plausible, correct link counts, and +move orphans back into the directory tree. +The salvage process is discussed in the case study at the end of this section. +The :ref:`file link count fsck ` code takes care of fixing link counts +and moving orphans to the ``/lost+found`` directory. + +Case Study: Salvaging Directories +````````````````````````````````` + +Unlike extended attributes, directory blocks are all the same size, so +salvaging directories is straightforward: + +1. Find the parent of the directory. + If the dotdot entry is not unreadable, try to confirm that the alleged + parent has a child entry pointing back to the directory being repaired. + Otherwise, walk the filesystem to find it. + +2. Walk the first partition of data fork of the directory to find the directory + entry data blocks. + When one is found, + + a. Walk the directory data block to find candidate entries. + When an entry is found: + + i. Check the name for problems, and ignore the name if there are. + + ii. Retrieve the inumber and grab the inode. + If that succeeds, add the name, inode number, and file type to the + staging xfarray and xblob. + +3. If the memory usage of the xfarray and xfblob exceed a certain amount of + memory or there are no more directory data blocks to examine, unlock the + directory and add the staged dirents into the temporary directory. + Truncate the staging files. + +4. Use atomic extent swapping to exchange the new and old directory structures. + The old directory blocks are now attached to the temporary file. + +5. Reap the temporary file. + +**Future Work Question**: Should repair revalidate the dentry cache when +rebuilding a directory? + +*Answer*: Yes, it should. + +In theory it is necessary to scan all dentry cache entries for a directory to +ensure that one of the following apply: + +1. The cached dentry reflects an ondisk dirent in the new directory. + +2. The cached dentry no longer has a corresponding ondisk dirent in the new + directory and the dentry can be purged from the cache. + +3. The cached dentry no longer has an ondisk dirent but the dentry cannot be + purged. + This is the problem case. + +Unfortunately, the current dentry cache design doesn't provide a means to walk +every child dentry of a specific directory, which makes this a hard problem. +There is no known solution. + +The proposed patchset is the +`directory repair +`_ +series. + +Parent Pointers +``````````````` + +A parent pointer is a piece of file metadata that enables a user to locate the +file's parent directory without having to traverse the directory tree from the +root. +Without them, reconstruction of directory trees is hindered in much the same +way that the historic lack of reverse space mapping information once hindered +reconstruction of filesystem space metadata. +The parent pointer feature, however, makes total directory reconstruction +possible. + +XFS parent pointers include the dirent name and location of the entry within +the parent directory. +In other words, child files use extended attributes to store pointers to +parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. +The directory checking process can be strengthened to ensure that the target of +each dirent also contains a parent pointer pointing back to the dirent. +Likewise, each parent pointer can be checked by ensuring that the target of +each parent pointer is a directory and that it contains a dirent matching +the parent pointer. +Both online and offline repair can use this strategy. + +**Note**: The ondisk format of parent pointers is not yet finalized. + ++--------------------------------------------------------------------------+ +| **Historical Sidebar**: | ++--------------------------------------------------------------------------+ +| Directory parent pointers were first proposed as an XFS feature more | +| than a decade ago by SGI. | +| Each link from a parent directory to a child file is mirrored with an | +| extended attribute in the child that could be used to identify the | +| parent directory. | +| Unfortunately, this early implementation had major shortcomings and was | +| never merged into Linux XFS: | +| | +| 1. The XFS codebase of the late 2000s did not have the infrastructure to | +| enforce strong referential integrity in the directory tree. | +| It did not guarantee that a change in a forward link would always be | +| followed up with the corresponding change to the reverse links. | +| | +| 2. Referential integrity was not integrated into offline repair. | +| Checking and repairs were performed on mounted filesystems without | +| taking any kernel or inode locks to coordinate access. | +| It is not clear how this actually worked properly. | +| | +| 3. The extended attribute did not record the name of the directory entry | +| in the parent, so the SGI parent pointer implementation cannot be | +| used to reconnect the directory tree. | +| | +| 4. Extended attribute forks only support 65,536 extents, which means | +| that parent pointer attribute creation is likely to fail at some | +| point before the maximum file link count is achieved. | +| | +| The original parent pointer design was too unstable for something like | +| a file system repair to depend on. | +| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | +| second implementation that solves all shortcomings of the first. | +| During 2022, Allison introduced log intent items to track physical | +| manipulations of the extended attribute structures. | +| This solves the referential integrity problem by making it possible to | +| commit a dirent update and a parent pointer update in the same | +| transaction. | +| Chandan increased the maximum extent counts of both data and attribute | +| forks, thereby ensuring that the extended attribute structure can grow | +| to handle the maximum hardlink count of any file. | ++--------------------------------------------------------------------------+ + +Case Study: Repairing Directories with Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Directory rebuilding uses a :ref:`coordinated inode scan ` and +a :ref:`directory entry live update hook ` as follows: + +1. Set up a temporary directory for generating the new directory structure, + an xfblob for storing entry names, and an xfarray for stashing directory + updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each parent pointer found in each file scanned, decide if the parent + pointer references the directory of interest. + If so: + + a. Stash an addname entry for this dirent in the xfarray for later. + + b. When finished scanning that file, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the child + has already been scanned. + If so: + + a. Stash an addname or removename entry for this dirent update in the + xfarray for later. + We cannot write directly to the temporary directory because hook + functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed updates to the temporary directory. + +5. When the scan is complete, atomically swap the contents of the temporary + directory and the directory being repaired. + The temporary directory now contains the damaged directory structure. + +6. Reap the temporary directory. + +7. Update the dirent position field of parent pointers as necessary. + This may require the queuing of a substantial number of xattr log intent + items. + +The proposed patchset is the +`parent pointers directory repair +`_ +series. + +**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields +match in the reconstructed directory? + +*Answer*: There are a few ways to solve this problem: + +1. The field could be designated advisory, since the other three values are + sufficient to find the entry in the parent. + However, this makes indexed key lookup impossible while repairs are ongoing. + +2. We could allow creating directory entries at specified offsets, which solves + the referential integrity problem but runs the risk that dirent creation + will fail due to conflicts with the free space in the directory. + + These conflicts could be resolved by appending the directory entry and + amending the xattr code to support updating an xattr key and reindexing the + dabtree, though this would have to be performed with the parent directory + still locked. + +3. Same as above, but remove the old parent pointer entry and add a new one + atomically. + +4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, + which would provide the attr name uniqueness that we require, without + forcing repair code to update the dirent position. + Unfortunately, this requires changes to the xattr code to support attr + names as long as 263 bytes. + +5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → + (name, parent_gen)``. + If the hash is sufficiently resistant to collisions (e.g. sha256) then + this should provide the attr name uniqueness that we require. + Names shorter than 247 bytes could be stored directly. + +Discussion is ongoing under the `parent pointers patch deluge +`_. + +Case Study: Repairing Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Online reconstruction of a file's parent pointer information works similarly to +directory reconstruction: + +1. Set up a temporary file for generating a new extended attribute structure, + an `xfblob` for storing parent pointer names, and an xfarray for + stashing parent pointer updates. + +2. Set up an inode scanner and hook into the directory entry code to receive + updates on directory operations. + +3. For each directory entry found in each directory scanned, decide if the + dirent references the file of interest. + If so: + + a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray + for later. + + b. When finished scanning the directory, flush the stashed updates to the + temporary directory. + +4. For each live directory update received via the hook, decide if the parent + has already been scanned. + If so: + + a. Stash an addpptr or removepptr entry for this dirent update in the + xfarray for later. + We cannot write parent pointers directly to the temporary file because + hook functions are not allowed to modify filesystem metadata. + Instead, we stash updates in the xfarray and rely on the scanner thread + to apply the stashed parent pointer updates to the temporary file. + +5. Copy all non-parent pointer extended attributes to the temporary file. + +6. When the scan is complete, atomically swap the attribute fork of the + temporary file and the file being repaired. + The temporary file now contains the damaged extended attribute structure. + +7. Reap the temporary file. + +The proposed patchset is the +`parent pointers repair +`_ +series. + +Digression: Offline Checking of Parent Pointers +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Examining parent pointers in offline repair works differently because corrupt +files are erased long before directory tree connectivity checks are performed. +Parent pointer checks are therefore a second pass to be added to the existing +connectivity checks: + +1. After the set of surviving files has been established (i.e. phase 6), + walk the surviving directories of each AG in the filesystem. + This is already performed as part of the connectivity checks. + +2. For each directory entry found, record the name in an xfblob, and store + ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a + per-AG in-memory slab. + +3. For each AG in the filesystem, + + a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and + dirent_pos. + + b. For each inode in the AG, + + 1. Scan the inode for parent pointers. + Record the names in a per-file xfblob, and store ``(parent_inum, + parent_gen, dirent_pos)`` tuples in a per-file slab. + + 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. + + 3. Position one slab cursor at the start of the inode's records in the + per-AG tuple slab. + This should be trivial since the per-AG tuples are in child inumber + order. + + 4. Position a second slab cursor at the start of the per-file tuple slab. + + 5. Iterate the two cursors in lockstep, comparing the parent_ino and + dirent_pos fields of the records under each cursor. + + a. Tuples in the per-AG list but not the per-file list are missing and + need to be written to the inode. + + b. Tuples in the per-file list but not the per-AG list are dangling + and need to be removed from the inode. + + c. For tuples in both lists, update the parent_gen and name components + of the parent pointer if necessary. + +4. Move on to examining link counts, as we do today. + +The proposed patchset is the +`offline parent pointers repair +`_ +series. + +Rebuilding directories from parent pointers in offline repair is very +challenging because it currently uses a single-pass scan of the filesystem +during phase 3 to decide which files are corrupt enough to be zapped. +This scan would have to be converted into a multi-pass scan: + +1. The first pass of the scan zaps corrupt inodes, forks, and attributes + much as it does now. + Corrupt directories are noted but not zapped. + +2. The next pass records parent pointers pointing to the directories noted + as being corrupt in the first pass. + This second pass may have to happen after the phase 4 scan for duplicate + blocks, if phase 4 is also capable of zapping directories. + +3. The third pass resets corrupt directories to an empty shortform directory. + Free space metadata has not been ensured yet, so repair cannot yet use the + directory building code in libxfs. + +4. At the start of phase 6, space metadata have been rebuilt. + Use the parent pointer information recorded during step 2 to reconstruct + the dirents and add them to the now-empty directories. + +This code has not yet been constructed. + +.. _orphanage: + +The Orphanage +------------- + +Filesystems present files as a directed, and hopefully acyclic, graph. +In other words, a tree. +The root of the filesystem is a directory, and each entry in a directory points +downwards either to more subdirectories or to non-directory files. +Unfortunately, a disruption in the directory graph pointers result in a +disconnected graph, which makes files impossible to access via regular path +resolution. + +Without parent pointers, the directory parent pointer online scrub code can +detect a dotdot entry pointing to a parent directory that doesn't have a link +back to the child directory and the file link count checker can detect a file +that isn't pointed to by any directory in the filesystem. +If such a file has a positive link count, the file is an orphan. + +With parent pointers, directories can be rebuilt by scanning parent pointers +and parent pointers can be rebuilt by scanning directories. +This should reduce the incidence of files ending up in ``/lost+found``. + +When orphans are found, they should be reconnected to the directory tree. +Offline fsck solves the problem by creating a directory ``/lost+found`` to +serve as an orphanage, and linking orphan files into the orphanage by using the +inumber as the name. +Reparenting a file to the orphanage does not reset any of its permissions or +ACLs. + +This process is more involved in the kernel than it is in userspace. +The directory and file link count repair setup functions must use the regular +VFS mechanisms to create the orphanage directory with all the necessary +security attributes and dentry cache entries, just like a regular directory +tree modification. + +Orphaned files are adopted by the orphanage as follows: + +1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function + to try to ensure that the lost and found directory actually exists. + This also attaches the orphanage directory to the scrub context. + +2. If the decision is made to reconnect a file, take the IOLOCK of both the + orphanage and the file being reattached. + The ``xrep_orphanage_iolock_two`` function follows the inode locking + strategy discussed earlier. + +3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` + to compute the new name in the orphanage and the block reservation required. + +4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair + transaction. + +5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost + and found, and update the kernel dentry cache. + +The proposed patches are in the +`orphanage adoption +`_ +series. + +6. Userspace Algorithms and Data Structures +=========================================== + +This section discusses the key algorithms and data structures of the userspace +program, ``xfs_scrub``, that provide the ability to drive metadata checks and +repairs in the kernel, verify file data, and look for other potential problems. + +.. _scrubcheck: + +Checking Metadata +----------------- + +Recall the :ref:`phases of fsck work` outlined earlier. +That structure follows naturally from the data dependencies designed into the +filesystem from its beginnings in 1993. +In XFS, there are several groups of metadata dependencies: + +a. Filesystem summary counts depend on consistency within the inode indices, + the allocation group space btrees, and the realtime volume space + information. + +b. Quota resource counts depend on consistency within the quota file data + forks, inode indices, inode records, and the forks of every file on the + system. + +c. The naming hierarchy depends on consistency within the directory and + extended attribute structures. + This includes file link counts. + +d. Directories, extended attributes, and file data depend on consistency within + the file forks that map directory and extended attribute data to physical + storage media. + +e. The file forks depends on consistency within inode records and the space + metadata indices of the allocation groups and the realtime volume. + This includes quota and realtime metadata files. + +f. Inode records depends on consistency within the inode metadata indices. + +g. Realtime space metadata depend on the inode records and data forks of the + realtime metadata inodes. + +h. The allocation group metadata indices (free space, inodes, reference count, + and reverse mapping btrees) depend on consistency within the AG headers and + between all the AG metadata btrees. + +i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support + for online fsck functionality. + +Therefore, a metadata dependency graph is a convenient way to schedule checking +operations in the ``xfs_scrub`` program: + +- Phase 1 checks that the provided path maps to an XFS filesystem and detect + the kernel's scrubbing abilities, which validates group (i). + +- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. + +- Phase 3 scans inodes in parallel. + For each inode, groups (f), (e), and (d) are checked, in that order. + +- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 + may run reliably. + +- Phase 5 starts by checking groups (b) and (c) in parallel before moving on + to checking names. + +- Phase 6 depends on groups (i) through (b) to find file data blocks to verify, + to read them, and to report which blocks of which files are affected. + +- Phase 7 checks group (a), having validated everything else. + +Notice that the data dependencies between groups are enforced by the structure +of the program flow. + +Parallel Inode Scans +-------------------- + +An XFS filesystem can easily contain hundreds of millions of inodes. +Given that XFS targets installations with large high-performance storage, +it is desirable to scrub inodes in parallel to minimize runtime, particularly +if the program has been invoked manually from a command line. +This requires careful scheduling to keep the threads as evenly loaded as +possible. + +Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single +workqueue and scheduled a single workqueue item per AG. +Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find +inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough +information to construct file handles. +The file handle was then passed to a function to generate scrub items for each +metadata object of each inode. +This simple algorithm leads to thread balancing problems in phase 3 if the +filesystem contains one AG with a few large sparse files and the rest of the +AGs contain many smaller files. +The inode scan dispatch function was not sufficiently granular; it should have +been dispatching at the level of individual inodes, or, to constrain memory +consumption, inode btree records. + +Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to +avoid this problem with ease by adding a second workqueue. +Just like before, the first workqueue is seeded with one workqueue item per AG, +and it uses INUMBERS to find inode btree chunks. +The second workqueue, however, is configured with an upper bound on the number +of items that can be waiting to be run. +Each inode btree chunk found by the first workqueue's workers are queued to the +second workqueue, and it is this second workqueue that queries BULKSTAT, +creates a file handle, and passes it to a function to generate scrub items for +each metadata object of each inode. +If the second workqueue is too full, the workqueue add function blocks the +first workqueue's workers until the backlog eases. +This doesn't completely solve the balancing problem, but reduces it enough to +move on to more pressing issues. + +The proposed patchsets are the scrub +`performance tweaks +`_ +and the +`inode scan rebalance +`_ +series. + +.. _scrubrepair: + +Scheduling Repairs +------------------ + +During phase 2, corruptions and inconsistencies reported in any AGI header or +inode btree are repaired immediately, because phase 3 relies on proper +functioning of the inode indices to find inodes to scan. +Failed repairs are rescheduled to phase 4. +Problems reported in any other space metadata are deferred to phase 4. +Optimization opportunities are always deferred to phase 4, no matter their +origin. + +During phase 3, corruptions and inconsistencies reported in any part of a +file's metadata are repaired immediately if all space metadata were validated +during phase 2. +Repairs that fail or cannot be repaired immediately are scheduled for phase 4. + +In the original design of ``xfs_scrub``, it was thought that repairs would be +so infrequent that the ``struct xfs_scrub_metadata`` objects used to +communicate with the kernel could also be used as the primary object to +schedule repairs. +With recent increases in the number of optimizations possible for a given +filesystem object, it became much more memory-efficient to track all eligible +repairs for a given filesystem object with a single repair item. +Each repair item represents a single lockable object -- AGs, metadata files, +individual inodes, or a class of summary information. + +Phase 4 is responsible for scheduling a lot of repair work in as quick a +manner as is practical. +The :ref:`data dependencies ` outlined earlier still apply, which +means that ``xfs_scrub`` must try to complete the repair work scheduled by +phase 2 before trying repair work scheduled by phase 3. +The repair process is as follows: + +1. Start a round of repair with a workqueue and enough workers to keep the CPUs + as busy as the user desires. + + a. For each repair item queued by phase 2, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + b. If any repairs were made, jump back to 1a to retry all the phase 2 items. + + c. For each repair item queued by phase 3, + + i. Ask the kernel to repair everything listed in the repair item for a + given filesystem object. + + ii. Make a note if the kernel made any progress in reducing the number + of repairs needed for this object. + + iii. If the object no longer requires repairs, revalidate all metadata + associated with this object. + If the revalidation succeeds, drop the repair item. + If not, requeue the item for more repairs. + + d. If any repairs were made, jump back to 1c to retry all the phase 3 items. + +2. If step 1 made any repair progress of any kind, jump back to step 1 to start + another round of repair. + +3. If there are items left to repair, run them all serially one more time. + Complain if the repairs were not successful, since this is the last chance + to repair anything. + +Corruptions and inconsistencies encountered during phases 5 and 7 are repaired +immediately. +Corrupt file data blocks reported by phase 6 cannot be recovered by the +filesystem. + +The proposed patchsets are the +`repair warning improvements +`_, +refactoring of the +`repair data dependency +`_ +and +`object tracking +`_, +and the +`repair scheduling +`_ +improvement series. + +Checking Names for Confusable Unicode Sequences +----------------------------------------------- + +If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of +phase 4, it moves on to phase 5, which checks for suspicious looking names in +the filesystem. +These names consist of the filesystem label, names in directory entries, and +the names of extended attributes. +Like most Unix filesystems, XFS imposes the sparest of constraints on the +contents of a name: + +- Slashes and null bytes are not allowed in directory entries. + +- Null bytes are not allowed in userspace-visible extended attributes. + +- Null bytes are not allowed in the filesystem label. + +Directory entries and attribute keys store the length of the name explicitly +ondisk, which means that nulls are not name terminators. +For this section, the term "naming domain" refers to any place where names are +presented together -- all the names in a directory, or all the attributes of a +file. + +Although the Unix naming constraints are very permissive, the reality of most +modern-day Linux systems is that programs work with Unicode character code +points to support international languages. +These programs typically encode those code points in UTF-8 when interfacing +with the C library because the kernel expects null-terminated names. +In the common case, therefore, names found in an XFS filesystem are actually +UTF-8 encoded Unicode data. + +To maximize its expressiveness, the Unicode standard defines separate control +points for various characters that render similarly or identically in writing +systems around the world. +For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders +identically to "Latin Small Letter A" U+0061 "a". + +The standard also permits characters to be constructed in multiple ways -- +either by using a defined code point, or by combining one code point with +various combining marks. +For example, the character "Angstrom Sign U+212B "Å" can also be expressed +as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" +U+030A "◌̊". +Both sequences render identically. + +Like the standards that preceded it, Unicode also defines various control +characters to alter the presentation of text. +For example, the character "Right-to-Left Override" U+202E can trick some +programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". +A second category of rendering problems involves whitespace characters. +If the character "Zero Width Space" U+200B is encountered in a file name, the +name will render identically to a name that does not have the zero width +space. + +If two names within a naming domain have different byte sequences but render +identically, a user may be confused by it. +The kernel, in its indifference to upper level encoding schemes, permits this. +Most filesystem drivers persist the byte sequence names that are given to them +by the VFS. + +Techniques for detecting confusable names are explained in great detail in +sections 4 and 5 of the +`Unicode Security Mechanisms `_ +document. +When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the +Unicode normalization form NFD in conjunction with the confusable name +detection component of +`libicu `_ +to identify names with a directory or within a file's extended attributes that +could be confused for each other. +Names are also checked for control characters, non-rendering characters, and +mixing of bidirectional characters. +All of these potential issues are reported to the system administrator during +phase 5. + +Media Verification of File Data Extents +--------------------------------------- + +The system administrator can elect to initiate a media scan of all file data +blocks. +This scan after validation of all filesystem metadata (except for the summary +counters) as phase 6. +The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map +to find areas that are allocated to file data fork extents. +Gaps between data fork extents that are smaller than 64k are treated as if +they were data fork extents to reduce the command setup overhead. +When the space map scan accumulates a region larger than 32MB, a media +verification request is sent to the disk as a directio read of the raw block +device. + +If the verification read fails, ``xfs_scrub`` retries with single-block reads +to narrow down the failure to the specific region of the media and recorded. +When it has finished issuing verification requests, it again uses the space +mapping ioctl to map the recorded media errors back to metadata structures +and report what has been lost. +For media errors in blocks owned by files, parent pointers can be used to +construct file paths from inode numbers for user-friendly reporting. + +7. Conclusion and Future Work +============================= + +It is hoped that the reader of this document has followed the designs laid out +in this document and now has some familiarity with how XFS performs online +rebuilding of its metadata indices, and how filesystem users can interact with +that functionality. +Although the scope of this work is daunting, it is hoped that this guide will +make it easier for code readers to understand what has been built, for whom it +has been built, and why. +Please feel free to contact the XFS mailing list with questions. + +FIEXCHANGE_RANGE +---------------- + +As discussed earlier, a second frontend to the atomic extent swap mechanism is +a new ioctl call that userspace programs can use to commit updates to files +atomically. +This frontend has been out for review for several years now, though the +necessary refinements to online repair and lack of customer demand mean that +the proposal has not been pushed very hard. + +Extent Swapping with Regular User Files +``````````````````````````````````````` + +As mentioned earlier, XFS has long had the ability to swap extents between +files, which is used almost exclusively by ``xfs_fsr`` to defragment files. +The earliest form of this was the fork swap mechanism, where the entire +contents of data forks could be exchanged between two files by exchanging the +raw bytes in each inode fork's immediate area. +When XFS v5 came along with self-describing metadata, this old mechanism grew +some log support to continue rewriting the owner fields of BMBT blocks during +log recovery. +When the reverse mapping btree was later added to XFS, the only way to maintain +the consistency of the fork mappings with the reverse mapping index was to +develop an iterative mechanism that used deferred bmap and rmap operations to +swap mappings one at a time. +This mechanism is identical to steps 2-3 from the procedure above except for +the new tracking items, because the atomic extent swap mechanism is an +iteration of an existing mechanism and not something totally novel. +For the narrow case of file defragmentation, the file contents must be +identical, so the recovery guarantees are not much of a gain. + +Atomic extent swapping is much more flexible than the existing swapext +implementations because it can guarantee that the caller never sees a mix of +old and new contents even after a crash, and it can operate on two arbitrary +file fork ranges. +The extra flexibility enables several new use cases: + +- **Atomic commit of file writes**: A userspace process opens a file that it + wants to update. + Next, it opens a temporary file and calls the file clone operation to reflink + the first file's contents into the temporary file. + Writes to the original file should instead be written to the temporary file. + Finally, the process calls the atomic extent swap system call + (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all + of the updates to the original file, or none of them. + +.. _swapext_if_unchanged: + +- **Transactional file updates**: The same mechanism as above, but the caller + only wants the commit to occur if the original file's contents have not + changed. + To make this happen, the calling process snapshots the file modification and + change timestamps of the original file before reflinking its data to the + temporary file. + When the program is ready to commit the changes, it passes the timestamps + into the kernel as arguments to the atomic extent swap system call. + The kernel only commits the changes if the provided timestamps match the + original file. + +- **Emulation of atomic block device writes**: Export a block device with a + logical sector size matching the filesystem block size to force all writes + to be aligned to the filesystem block size. + Stage all writes to a temporary file, and when that is complete, call the + atomic extent swap system call with a flag to indicate that holes in the + temporary file should be ignored. + This emulates an atomic device write in software, and can support arbitrary + scattered writes. + +Vectorized Scrub +---------------- + +As it turns out, the :ref:`refactoring ` of repair items mentioned +earlier was a catalyst for enabling a vectorized scrub system call. +Since 2018, the cost of making a kernel call has increased considerably on some +systems to mitigate the effects of speculative execution attacks. +This incentivizes program authors to make as few system calls as possible to +reduce the number of times an execution path crosses a security boundary. + +With vectorized scrub, userspace pushes to the kernel the identity of a +filesystem object, a list of scrub types to run against that object, and a +simple representation of the data dependencies between the selected scrub +types. +The kernel executes as much of the caller's plan as it can until it hits a +dependency that cannot be satisfied due to a corruption, and tells userspace +how much was accomplished. +It is hoped that ``io_uring`` will pick up enough of this functionality that +online fsck can use that instead of adding a separate vectored scrub system +call to XFS. + +The relevant patchsets are the +`kernel vectorized scrub +`_ +and +`userspace vectorized scrub +`_ +series. + +Quality of Service Targets for Scrub +------------------------------------ + +One serious shortcoming of the online fsck code is that the amount of time that +it can spend in the kernel holding resource locks is basically unbounded. +Userspace is allowed to send a fatal signal to the process which will cause +``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way +for userspace to provide a time budget to the kernel. +Given that the scrub codebase has helpers to detect fatal signals, it shouldn't +be too much work to allow userspace to specify a timeout for a scrub/repair +operation and abort the operation if it exceeds budget. +However, most repair functions have the property that once they begin to touch +ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS +timeout is no longer useful. + +Defragmenting Free Space +------------------------ + +Over the years, many XFS users have requested the creation of a program to +clear a portion of the physical storage underlying a filesystem so that it +becomes a contiguous chunk of free space. +Call this free space defragmenter ``clearspace`` for short. + +The first piece the ``clearspace`` program needs is the ability to read the +reverse mapping index from userspace. +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. +The second piece it needs is a new fallocate mode +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and +maps it to a file. +Call this file the "space collector" file. +The third piece is the ability to force an online repair. + +To clear all the metadata out of a portion of physical storage, clearspace +uses the new fallocate map-freespace call to map any free space in that region +to the space collector file. +Next, clearspace finds all metadata blocks in that region by way of +``GETFSMAP`` and issues forced repair requests on the data structure. +This often results in the metadata being rebuilt somewhere that is not being +cleared. +After each relocation, clearspace calls the "map free space" function again to +collect any newly freed space in the region being cleared. + +To clear all the file data out of a portion of the physical storage, clearspace +uses the FSMAP information to find relevant file data blocks. +Having identified a good target, it uses the ``FICLONERANGE`` call on that part +of the file to try to share the physical space with a dummy file. +Cloning the extent means that the original owners cannot overwrite the +contents; any changes will be written somewhere else via copy-on-write. +Clearspace makes its own copy of the frozen extent in an area that is not being +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap +` feature) to change the target file's data extent +mapping away from the area being cleared. +When all other mappings have been moved, clearspace reflinks the space into the +space collector file so that it becomes unavailable. + +There are further optimizations that could apply to the above algorithm. +To clear a piece of physical storage that has a high sharing factor, it is +strongly desirable to retain this sharing factor. +In fact, these extents should be moved first to maximize sharing factor after +the operation completes. +To make this work smoothly, clearspace needs a new ioctl +(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. +With the refcount information exposed, clearspace can quickly find the longest, +most shared data extents in the filesystem, and target them first. + +**Future Work Question**: How might the filesystem move inode chunks? + +*Answer*: To move inode chunks, Dave Chinner constructed a prototype program +that creates a new file with the old contents and then locklessly runs around +the filesystem updating directory entries. +The operation cannot complete if the filesystem goes down. +That problem isn't totally insurmountable: create an inode remapping table +hidden behind a jump label, and a log item that tracks the kernel walking the +filesystem to update directory entries. +The trouble is, the kernel can't do anything about open files, since it cannot +revoke them. + +**Future Work Question**: Can static keys be used to minimize the cost of +supporting ``revoke()`` on XFS files? + +*Answer*: Yes. +Until the first revocation, the bailout code need not be in the call path at +all. + +The relevant patchsets are the +`kernel freespace defrag +`_ +and +`userspace freespace defrag +`_ +series. + +Shrinking Filesystems +--------------------- + +Removing the end of the filesystem ought to be a simple matter of evacuating +the data and metadata at the end of the filesystem, and handing the freed space +to the shrink code. +That requires an evacuation of the space at end of the filesystem, which is a +use of free space defragmentation! diff --git a/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst new file mode 100644 index 000000000000..a10c4ae6955e --- /dev/null +++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst @@ -0,0 +1,353 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfs_self_describing_metadata: + +============================ +XFS Self Describing Metadata +============================ + +Introduction +============ + +The largest scalability problem facing XFS is not one of algorithmic +scalability, but of verification of the filesystem structure. Scalabilty of the +structures and indexes on disk and the algorithms for iterating them are +adequate for supporting PB scale filesystems with billions of inodes, however it +is this very scalability that causes the verification problem. + +Almost all metadata on XFS is dynamically allocated. The only fixed location +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all +other metadata structures need to be discovered by walking the filesystem +structure in different ways. While this is already done by userspace tools for +validating and repairing the structure, there are limits to what they can +verify, and this in turn limits the supportable size of an XFS filesystem. + +For example, it is entirely possible to manually use xfs_db and a bit of +scripting to analyse the structure of a 100TB filesystem when trying to +determine the root cause of a corruption problem, but it is still mainly a +manual task of verifying that things like single bit errors or misplaced writes +weren't the ultimate cause of a corruption event. It may take a few hours to a +few days to perform such forensic analysis, so for at this scale root cause +analysis is entirely possible. + +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata +to analyse and so that analysis blows out towards weeks/months of forensic work. +Most of the analysis work is slow and tedious, so as the amount of analysis goes +up, the more likely that the cause will be lost in the noise. Hence the primary +concern for supporting PB scale filesystems is minimising the time and effort +required for basic forensic analysis of the filesystem structure. + + +Self Describing Metadata +======================== + +One of the problems with the current metadata format is that apart from the +magic number in the metadata block, we have no other way of identifying what it +is supposed to be. We can't even identify if it is the right place. Put simply, +you can't look at a single metadata block in isolation and say "yes, it is +supposed to be there and the contents are valid". + +Hence most of the time spent on forensic analysis is spent doing basic +verification of metadata values, looking for values that are in range (and hence +not detected by automated verification checks) but are not correct. Finding and +understanding how things like cross linked block lists (e.g. sibling +pointers in a btree end up with loops in them) are the key to understanding what +went wrong, but it is impossible to tell what order the blocks were linked into +each other or written to disk after the fact. + +Hence we need to record more information into the metadata to allow us to +quickly determine if the metadata is intact and can be ignored for the purpose +of analysis. We can't protect against every possible type of error, but we can +ensure that common types of errors are easily detectable. Hence the concept of +self describing metadata. + +The first, fundamental requirement of self describing metadata is that the +metadata object contains some form of unique identifier in a well known +location. This allows us to identify the expected contents of the block and +hence parse and verify the metadata object. IF we can't independently identify +the type of metadata in the object, then the metadata doesn't describe itself +very well at all! + +Luckily, almost all XFS metadata has magic numbers embedded already - only the +AGFL, remote symlinks and remote attribute blocks do not contain identifying +magic numbers. Hence we can change the on-disk format of all these objects to +add more identifying information and detect this simply by changing the magic +numbers in the metadata objects. That is, if it has the current magic number, +the metadata isn't self identifying. If it contains a new magic number, it is +self identifying and we can do much more expansive automated verification of the +metadata object at runtime, during forensic analysis or repair. + +As a primary concern, self describing metadata needs some form of overall +integrity checking. We cannot trust the metadata if we cannot verify that it has +not been changed as a result of external influences. Hence we need some form of +integrity check, and this is done by adding CRC32c validation to the metadata +block. If we can verify the block contains the metadata it was intended to +contain, a large amount of the manual verification work can be skipped. + +CRC32c was selected as metadata cannot be more than 64k in length in XFS and +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is +fast. So while CRC32c is not the strongest of possible integrity checks that +could be used, it is more than sufficient for our needs and has relatively +little overhead. Adding support for larger integrity fields and/or algorithms +does really provide any extra value over CRC32c, but it does add a lot of +complexity and so there is no provision for changing the integrity checking +mechanism. + +Self describing metadata needs to contain enough information so that the +metadata block can be verified as being in the correct place without needing to +look at any other metadata. This means it needs to contain location information. +Just adding a block number to the metadata is not sufficient to protect against +mis-directed writes - a write might be misdirected to the wrong LUN and so be +written to the "correct block" of the wrong filesystem. Hence location +information must contain a filesystem identifier as well as a block number. + +Another key information point in forensic analysis is knowing who the metadata +block belongs to. We already know the type, the location, that it is valid +and/or corrupted, and how long ago that it was last modified. Knowing the owner +of the block is important as it allows us to find other related metadata to +determine the scope of the corruption. For example, if we have a extent btree +object, we don't know what inode it belongs to and hence have to walk the entire +filesystem to find the owner of the block. Worse, the corruption could mean that +no owner can be found (i.e. it's an orphan block), and so without an owner field +in the metadata we have no idea of the scope of the corruption. If we have an +owner field in the metadata object, we can immediately do top down validation to +determine the scope of the problem. + +Different types of metadata have different owner identifiers. For example, +directory, attribute and extent tree blocks are all owned by an inode, while +freespace btree blocks are owned by an allocation group. Hence the size and +contents of the owner field are determined by the type of metadata object we are +looking at. The owner information can also identify misplaced writes (e.g. +freespace btree block written to the wrong AG). + +Self describing metadata also needs to contain some indication of when it was +written to the filesystem. One of the key information points when doing forensic +analysis is how recently the block was modified. Correlation of set of corrupted +metadata blocks based on modification times is important as it can indicate +whether the corruptions are related, whether there's been multiple corruption +events that lead to the eventual failure, and even whether there are corruptions +present that the run-time verification is not detecting. + +For example, we can determine whether a metadata object is supposed to be free +space or still allocated if it is still referenced by its owner by looking at +when the free space btree block that contains the block was last written +compared to when the metadata object itself was last written. If the free space +block is more recent than the object and the object's owner, then there is a +very good chance that the block should have been removed from the owner. + +To provide this "written timestamp", each metadata block gets the Log Sequence +Number (LSN) of the most recent transaction it was modified on written into it. +This number will always increase over the life of the filesystem, and the only +thing that resets it is running xfs_repair on the filesystem. Further, by use of +the LSN we can tell if the corrupted metadata all belonged to the same log +checkpoint and hence have some idea of how much modification occurred between +the first and last instance of corrupt metadata on disk and, further, how much +modification occurred between the corruption being written and when it was +detected. + +Runtime Validation +================== + +Validation of self-describing metadata takes place at runtime in two places: + + - immediately after a successful read from disk + - immediately prior to write IO submission + +The verification is completely stateless - it is done independently of the +modification process, and seeks only to check that the metadata is what it says +it is and that the metadata fields are within bounds and internally consistent. +As such, we cannot catch all types of corruption that can occur within a block +as there may be certain limitations that operational state enforces of the +metadata, or there may be corruption of interblock relationships (e.g. corrupted +sibling pointer lists). Hence we still need stateful checking in the main code +body, but in general most of the per-field validation is handled by the +verifiers. + +For read verification, the caller needs to specify the expected type of metadata +that it should see, and the IO completion process verifies that the metadata +object matches what was expected. If the verification process fails, then it +marks the object being read as EFSCORRUPTED. The caller needs to catch this +error (same as for IO errors), and if it needs to take special action due to a +verification error it can do so by catching the EFSCORRUPTED error value. If we +need more discrimination of error type at higher levels, we can define new +error numbers for different errors as necessary. + +The first step in read verification is checking the magic number and determining +whether CRC validating is necessary. If it is, the CRC32c is calculated and +compared against the value stored in the object itself. Once this is validated, +further checks are made against the location information, followed by extensive +object specific metadata validation. If any of these checks fail, then the +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. + +Write verification is the opposite of the read verification - first the object +is extensively verified and if it is OK we then update the LSN from the last +modification made to the object, After this, we calculate the CRC and insert it +into the object. Once this is done the write IO is allowed to continue. If any +error occurs during this process, the buffer is again marked with a EFSCORRUPTED +error for the higher layers to catch. + +Structures +========== + +A typical on-disk structure needs to contain the following information:: + + struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ + }; + +Depending on the metadata, this information may be part of a header structure +separate to the metadata contents, or may be distributed through an existing +structure. The latter occurs with metadata that already contains some of this +information, such as the superblock and AG headers. + +Other metadata may have different formats for the information, but the same +level of information is generally provided. For example: + + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block + number for location. The two of these combined provide the same + information as @owner and @blkno in eh above structure, but using 8 + bytes less space on disk. + + - directory/attribute node blocks have a 16 bit magic number, and the + header that contains the magic number has other information in it as + well. hence the additional metadata headers change the overall format + of the metadata. + +A typical buffer read verifier is structured as follows:: + + #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + + static void + xfs_foo_read_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } + } + +The code ensures that the CRC is only checked if the filesystem has CRCs enabled +by checking the superblock of the feature bit, and then if the CRC verifies OK +(or is not needed) it verifies the actual contents of the block. + +The verifier function will take a couple of different forms, depending on +whether the magic number can be used to determine the format of the block. In +the case it can't, the code is structured as follows:: + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } + + /* object specific verification checks here */ + + return true; + } + +If there are different magic numbers for the different formats, the verifier +will look like:: + + static bool + xfs_foo_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; + } + +Write verifiers are very similar to the read verifiers, they just do things in +the opposite order to the read verifiers. A typical write verifier:: + + static void + xfs_foo_write_verify( + struct xfs_buf *bp) + { + struct xfs_mount *mp = bp->b_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; + + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; + + + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); + } + +This will verify the internal structure of the metadata before we go any +further, detecting corruptions that have occurred as the metadata has been +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then +update the LSN field (when it was last modified) and calculate the CRC on the +metadata. Once this is done, we can issue the IO. + +Inodes and Dquots +================= + +Inodes and dquots are special snowflakes. They have per-object CRC and +self-identifiers, but they are packed so that there are multiple objects per +buffer. Hence we do not use per-buffer verifiers to do the work of per-object +verification and CRC calculations. The per-buffer verifiers simply perform basic +identification of the buffer - that they contain inodes or dquots, and that +there are magic numbers in all the expected spots. All further CRC and +verification checks are done when each inode is read from or written back to the +buffer. + +The structure of the verifiers and the identifiers checks is very similar to the +buffer code described above. The only difference is where they are called. For +example, inode read verification is done in xfs_inode_from_disk() when the inode +is first read out of the buffer and the struct xfs_inode is instantiated. The +inode is already extensively verified during writeback in xfs_iflush_int, so the +only addition here is to add the LSN and CRC to the inode as it is copied back +into the buffer. + +XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of +the unlinked list modifications check or update CRCs, neither during unlink nor +log recovery. So, it's gone unnoticed until now. This won't matter immediately - +repair will probably complain about it - but it needs to be fixed. diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst index 7ad4bfc2cc03..18cee1edaecb 100644 --- a/Documentation/maintainer/maintainer-entry-profile.rst +++ b/Documentation/maintainer/maintainer-entry-profile.rst @@ -105,4 +105,4 @@ to do something different in the near future. ../driver-api/media/maintainer-entry-profile ../driver-api/vfio-pci-device-specific-driver-acceptance ../nvme/feature-and-quirk-policy - ../filesystems/xfs-maintainer-entry-profile + ../filesystems/xfs/xfs-maintainer-entry-profile diff --git a/MAINTAINERS b/MAINTAINERS index 788be9ab5b73..76896b511c6a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -23894,10 +23894,10 @@ S: Supported W: http://xfs.org/ C: irc://irc.oftc.net/xfs T: git git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git -P: Documentation/filesystems/xfs-maintainer-entry-profile.rst +P: Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst F: Documentation/ABI/testing/sysfs-fs-xfs F: Documentation/admin-guide/xfs.rst -F: Documentation/filesystems/xfs-* +F: Documentation/filesystems/xfs/* F: fs/xfs/ F: include/uapi/linux/dqblk_xfs.h F: include/uapi/linux/fsmap.h -- cgit