summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst60
-rw-r--r--Documentation/filesystems/api-summary.rst3
-rw-r--r--Documentation/filesystems/autofs.rst6
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst186
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst98
-rw-r--r--Documentation/filesystems/bcachefs/errorcodes.rst30
-rw-r--r--Documentation/filesystems/bcachefs/index.rst13
-rw-r--r--Documentation/filesystems/buffer.rst12
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst2
-rw-r--r--Documentation/filesystems/caching/fscache.rst8
-rw-r--r--Documentation/filesystems/ceph.rst15
-rw-r--r--Documentation/filesystems/debugfs.rst12
-rw-r--r--Documentation/filesystems/directory-locking.rst4
-rw-r--r--Documentation/filesystems/dlmfs.rst2
-rw-r--r--Documentation/filesystems/efivarfs.rst2
-rw-r--r--Documentation/filesystems/erofs.rst2
-rw-r--r--Documentation/filesystems/f2fs.rst127
-rw-r--r--Documentation/filesystems/fiemap.rst49
-rw-r--r--Documentation/filesystems/files.rst2
-rw-r--r--Documentation/filesystems/fscrypt.rst27
-rw-r--r--Documentation/filesystems/fsverity.rst29
-rw-r--r--Documentation/filesystems/fuse-io-uring.rst99
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst55
-rw-r--r--Documentation/filesystems/idmappings.rst12
-rw-r--r--Documentation/filesystems/index.rst6
-rw-r--r--Documentation/filesystems/iomap/design.rst441
-rw-r--r--Documentation/filesystems/iomap/index.rst13
-rw-r--r--Documentation/filesystems/iomap/operations.rst728
-rw-r--r--Documentation/filesystems/iomap/porting.rst120
-rw-r--r--Documentation/filesystems/journalling.rst6
-rw-r--r--Documentation/filesystems/locking.rst15
-rw-r--r--Documentation/filesystems/mount_api.rst12
-rw-r--r--Documentation/filesystems/multigrain-ts.rst125
-rw-r--r--Documentation/filesystems/netfs_library.rst3
-rw-r--r--Documentation/filesystems/nfs/exporting.rst7
-rw-r--r--Documentation/filesystems/nfs/index.rst1
-rw-r--r--Documentation/filesystems/nfs/localio.rst357
-rw-r--r--Documentation/filesystems/ntfs.rst466
-rw-r--r--Documentation/filesystems/overlayfs.rst32
-rw-r--r--Documentation/filesystems/path-lookup.rst2
-rw-r--r--Documentation/filesystems/path-lookup.txt2
-rw-r--r--Documentation/filesystems/porting.rst31
-rw-r--r--Documentation/filesystems/proc.rst138
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.rst2
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst26
-rw-r--r--Documentation/filesystems/squashfs.rst14
-rw-r--r--Documentation/filesystems/tmpfs.rst24
-rw-r--r--Documentation/filesystems/vfs.rst55
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst666
49 files changed, 3211 insertions, 936 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 1e0e0bb6fdf9..2bbf68b56b0d 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -31,7 +31,7 @@ Other applications are described in the following papers:
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
+ https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf
Usage
=====
@@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-where mount_tag is the tag associated by the server to each of the exported
+where mount_tag is the tag generated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+USBG Usage
+==========
+
+To mount a 9p FS on a USB Host accessible via the gadget at runtime::
+
+ mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9
+
+To mount a 9p FS on a USB Host accessible via the gadget as root filesystem::
+
+ root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs
+
+where <device> is the tag associated by the usb gadget transport.
+It is defined by the configfs instance name.
+
+USBG Example
+============
+
+The USB host exports a filesystem, while the gadget on the USB device
+side makes it mountable.
+
+Diod (9pfs server) and the forwarder are on the development host, where
+the root filesystem is actually stored. The gadget is initialized during
+boot (or later) on the embedded board. Then the forwarder will find it
+on the USB bus and start forwarding requests.
+
+In this case the 9p requests come from the device and are handled by the
+host. The reason is that USB device ports are normally not available on
+PCs, so a connection in the other direction would not work.
+
+When using the usbg transport, for now there is no native usb host
+service capable to handle the requests from the gadget driver. For
+this we have to use the extra python tool p9_fwd.py from tools/usb.
+
+Just start the 9pfs capable network server like diod/nfs-ganesha e.g.::
+
+ $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD
+
+Optionaly scan your bus if there are more then one usbg gadgets to find their path::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py list
+
+ Bus | Addr | Manufacturer | Product | ID | Path
+ --- | ---- | ---------------- | ---------------- | --------- | ----
+ 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2
+ 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3
+
+Then start the python transport::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999
+
+After that the gadget driver can be used as described above.
+
+One use-case is to use it as an alternative to NFS root booting during
+the development of embedded Linux devices.
+
Options
=======
@@ -68,6 +123,7 @@ Options
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
+ usbg connect to a specified usb gadget channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index 98db2ea5fa12..cc5cc7f3fbd8 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -56,9 +56,6 @@ Other Functions
.. kernel-doc:: fs/namei.c
:export:
-.. kernel-doc:: fs/buffer.c
- :export:
-
.. kernel-doc:: block/bio.c
:export:
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 3b6e38e646cd..5eb02394fcc3 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -18,7 +18,7 @@ key advantages:
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
- in that data base at the time of access will be used to provide
+ in that database at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
@@ -423,7 +423,7 @@ The available ioctl commands are:
and objects are expired if the are not in use.
**AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
+ and objects are expired even if they are in use. This assumes
that the daemon has requested this because it is capable of
performing the umount.
@@ -442,7 +442,7 @@ which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
The 'ioctl's that can be used on this device are described in a separate
-document `autofs-mount-control.txt`, and are summarised briefly here.
+document `autofs-mount-control.rst`, and are summarised briefly here.
Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct autofs_dev_ioctl {
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
new file mode 100644
index 000000000000..b29562a6bf55
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/CodingStyle.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs coding style
+=====================
+
+Good development is like gardening, and codebases are our gardens. Tend to them
+every day; look for little things that are out of place or in need of tidying.
+A little weeding here and there goes a long way; don't wait until things have
+spiraled out of control.
+
+Things don't always have to be perfect - nitpicking often does more harm than
+good. But appreciate beauty when you see it - and let people know.
+
+The code that you are afraid to touch is the code most in need of refactoring.
+
+A little organizing here and there goes a long way.
+
+Put real thought into how you organize things.
+
+Good code is readable code, where the structure is simple and leaves nowhere
+for bugs to hide.
+
+Assertions are one of our most important tools for writing reliable code. If in
+the course of writing a patchset you encounter a condition that shouldn't
+happen (and will have unpredictable or undefined behaviour if it does), or
+you're not sure if it can happen and not sure how to handle it yet - make it a
+BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
+
+By the time you finish the patchset, you should understand better which
+assertions need to be handled and turned into checks with error paths, and
+which should be logically impossible. Leave the BUG_ON()s in for the ones which
+are logically impossible. (Or, make them debug mode assertions if they're
+expensive - but don't turn everything into a debug mode assertion, so that
+we're not stuck debugging undefined behaviour should it turn out that you were
+wrong).
+
+Assertions are documentation that can't go out of date. Good assertions are
+wonderful.
+
+Good assertions drastically and dramatically reduce the amount of testing
+required to shake out bugs.
+
+Good assertions are based on state, not logic. To write good assertions, you
+have to think about what the invariants on your state are.
+
+Good invariants and assertions will hold everywhere in your codebase. This
+means that you can run them in only a few places in the checked in version, but
+should you need to debug something that caused the assertion to fail, you can
+quickly shotgun them everywhere to find the codepath that broke the invariant.
+
+A good assertion checks something that the compiler could check for us, and
+elide - if we were working in a language with embedded correctness proofs that
+the compiler could check. This is something that exists today, but it'll likely
+still be a few decades before it comes to systems programming languages. But we
+can still incorporate that kind of thinking into our code and document the
+invariants with runtime checks - much like the way people working in
+dynamically typed languages may add type annotations, gradually making their
+code statically typed.
+
+Looking for ways to make your assertions simpler - and higher level - will
+often nudge you towards making the entire system simpler and more robust.
+
+Good code is code where you can poke around and see what it's doing -
+introspection. We can't debug anything if we can't see what's going on.
+
+Whenever we're debugging, and the solution isn't immediately obvious, if the
+issue is that we don't know where the issue is because we can't see what's
+going on - fix that first.
+
+We have the tools to make anything visible at runtime, efficiently - RCU and
+percpu data structures among them. Don't let things stay hidden.
+
+The most important tool for introspection is the humble pretty printer - in
+bcachefs, this means `*_to_text()` functions, which output to printbufs.
+
+Pretty printers are wonderful, because they compose and you can use them
+everywhere. Having functions to print whatever object you're working with will
+make your error messages much easier to write (therefore they will actually
+exist) and much more informative. And they can be used from sysfs/debugfs, as
+well as tracepoints.
+
+Runtime info and debugging tools should come with clear descriptions and
+labels, and good structure - we don't want files with a list of bare integers,
+like in procfs. Part of the job of the debugging tools is to educate users and
+new developers as to how the system works.
+
+Error messages should, whenever possible, tell you everything you need to debug
+the issue. It's worth putting effort into them.
+
+Tracepoints shouldn't be the first thing you reach for. They're an important
+tool, but always look for more immediate ways to make things visible. When we
+have to rely on tracing, we have to know which tracepoints we're looking for,
+and then we have to run the troublesome workload, and then we have to sift
+through logs. This is a lot of steps to go through when a user is hitting
+something, and if it's intermittent it may not even be possible.
+
+The humble counter is an incredibly useful tool. They're cheap and simple to
+use, and many complicated internal operations with lots of things that can
+behave weirdly (anything involving memory reclaim, for example) become
+shockingly easy to debug once you have counters on every distinct codepath.
+
+Persistent counters are even better.
+
+When debugging, try to get the most out of every bug you come across; don't
+rush to fix the initial issue. Look for things that will make related bugs
+easier the next time around - introspection, new assertions, better error
+messages, new debug tools, and do those first. Look for ways to make the system
+better behaved; often one bug will uncover several other bugs through
+downstream effects.
+
+Fix all that first, and then the original bug last - even if that means keeping
+a user waiting. They'll thank you in the long run, and when they understand
+what you're doing you'll be amazed at how patient they're happy to be. Users
+like to help - otherwise they wouldn't be reporting the bug in the first place.
+
+Talk to your users. Don't isolate yourself.
+
+Users notice all sorts of interesting things, and by just talking to them and
+interacting with them you can benefit from their experience.
+
+Spend time doing support and helpdesk stuff. Don't just write code - code isn't
+finished until it's being used trouble free.
+
+This will also motivate you to make your debugging tools as good as possible,
+and perhaps even your documentation, too. Like anything else in life, the more
+time you spend at it the better you'll get, and you the developer are the
+person most able to improve the tools to make debugging quick and easy.
+
+Be wary of how you take on and commit to big projects. Don't let development
+become product-manager focused. Often time an idea is a good one but needs to
+wait for its proper time - but you won't know if it's the proper time for an
+idea until you start writing code.
+
+Expect to throw a lot of things away, or leave them half finished for later.
+Nobody writes all perfect code that all gets shipped, and you'll be much more
+productive in the long run if you notice this early and shift to something
+else. The experience gained and lessons learned will be valuable for all the
+other work you do.
+
+But don't be afraid to tackle projects that require significant rework of
+existing code. Sometimes these can be the best projects, because they can lead
+us to make existing code more general, more flexible, more multipurpose and
+perhaps more robust. Just don't hesitate to abandon the idea if it looks like
+it's going to make a mess of things.
+
+Complicated features can often be done as a series of refactorings, with the
+final change that actually implements the feature as a quite small patch at the
+end. It's wonderful when this happens, especially when those refactorings are
+things that improve the codebase in their own right. When that happens there's
+much less risk of wasted effort if the feature you were going for doesn't work
+out.
+
+Always strive to work incrementally. Always strive to turn the big projects
+into little bite sized projects that can prove their own merits.
+
+Instead of always tackling those big projects, look for little things that
+will be useful, and make the big projects easier.
+
+The question of what's likely to be useful is where junior developers most
+often go astray - doing something because it seems like it'll be useful often
+leads to overengineering. Knowing what's useful comes from many years of
+experience, or talking with people who have that experience - or from simply
+reading lots of code and looking for common patterns and issues. Don't be
+afraid to throw things away and do something simpler.
+
+Talk about your ideas with your fellow developers; often times the best things
+come from relaxed conversations where people aren't afraid to say "what if?".
+
+Don't neglect your tools.
+
+The most important tools (besides the compiler and our text editor) are the
+tools we use for testing. The shortest possible edit/test/debug cycle is
+essential for working productively. We learn, gain experience, and discover the
+errors in our thinking by running our code and seeing what happens. If your
+time is being wasted because your tools are bad or too slow - don't accept it,
+fix it.
+
+Put effort into your documentation, commit messages, and code comments - but
+don't go overboard. A good commit message is wonderful - but if the information
+was important enough to go in a commit message, ask yourself if it would be
+even better as a code comment.
+
+A good code comment is wonderful, but even better is the comment that didn't
+need to exist because the code was so straightforward as to be obvious;
+organized into small clean and tidy modules, with clear and descriptive names
+for functions and variables, where every line of code has a clear purpose.
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
new file mode 100644
index 000000000000..026b12ae0d6a
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
@@ -0,0 +1,98 @@
+Submitting patches to bcachefs:
+===============================
+
+Patches must be tested before being submitted, either with the xfstests suite
+[0], or the full bcachefs test suite in ktest [1], depending on what's being
+touched. Note that ktest wraps xfstests and will be an easier method to running
+it for most users; it includes single-command wrappers for all the mainstream
+in-kernel local filesystems.
+
+Patches will undergo more testing after being merged (including
+lockdep/kasan/preempt/etc. variants), these are not generally required to be
+run by the submitter - but do put some thought into what you're changing and
+which tests might be relevant, e.g. are you dealing with tricky memory layout
+work? kasan, are you doing locking work? then lockdep; and ktest includes
+single-command variants for the debug build types you'll most likely need.
+
+The exception to this rule is incomplete WIP/RFC patches: if you're working on
+something nontrivial, it's encouraged to send out a WIP patch to let people
+know what you're doing and make sure you're on the right track. Just make sure
+it includes a brief note as to what's done and what's incomplete, to avoid
+confusion.
+
+Rigorous checkpatch.pl adherence is not required (many of its warnings are
+considered out of date), but try not to deviate too much without reason.
+
+Focus on writing code that reads well and is organized well; code should be
+aesthetically pleasing.
+
+CI:
+===
+
+Instead of running your tests locally, when running the full test suite it's
+prefereable to let a server farm do it in parallel, and then have the results
+in a nice test dashboard (which can tell you which failures are new, and
+presents results in a git log view, avoiding the need for most bisecting).
+
+That exists [2], and community members may request an account. If you work for
+a big tech company, you'll need to help out with server costs to get access -
+but the CI is not restricted to running bcachefs tests: it runs any ktest test
+(which generally makes it easy to wrap other tests that can run in qemu).
+
+Other things to think about:
+============================
+
+- How will we debug this code? Is there sufficient introspection to diagnose
+ when something starts acting wonky on a user machine?
+
+ We don't necessarily need every single field of every data structure visible
+ with introspection, but having the important fields of all the core data
+ types wired up makes debugging drastically easier - a bit of thoughtful
+ foresight greatly reduces the need to have people build custom kernels with
+ debug patches.
+
+ More broadly, think about all the debug tooling that might be needed.
+
+- Does it make the codebase more or less of a mess? Can we also try to do some
+ organizing, too?
+
+- Do new tests need to be written? New assertions? How do we know and verify
+ that the code is correct, and what happens if something goes wrong?
+
+ We don't yet have automated code coverage analysis or easy fault injection -
+ but for now, pretend we did and ask what they might tell us.
+
+ Assertions are hugely important, given that we don't yet have a systems
+ language that can do ergonomic embedded correctness proofs. Hitting an assert
+ in testing is much better than wandering off into undefined behaviour la-la
+ land - use them. Use them judiciously, and not as a replacement for proper
+ error handling, but use them.
+
+- Does it need to be performance tested? Should we add new peformance counters?
+
+ bcachefs has a set of persistent runtime counters which can be viewed with
+ the 'bcachefs fs top' command; this should give users a basic idea of what
+ their filesystem is currently doing. If you're doing a new feature or looking
+ at old code, think if anything should be added.
+
+- If it's a new on disk format feature - have upgrades and downgrades been
+ tested? (Automated tests exists but aren't in the CI, due to the hassle of
+ disk image management; coordinate to have them run.)
+
+Mailing list, IRC:
+==================
+
+Patches should hit the list [3], but much discussion and code review happens on
+IRC as well [4]; many people appreciate the more conversational approach and
+quicker feedback.
+
+Additionally, we have a lively user community doing excellent QA work, which
+exists primarily on IRC. Please make use of that resource; user feedback is
+important for any nontrivial feature, and documenting it in commit messages
+would be a good idea.
+
+[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+[1]: https://evilpiepirate.org/git/ktest.git/
+[2]: https://evilpiepirate.org/~testdashboard/ci/
+[3]: linux-bcachefs@vger.kernel.org
+[4]: irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst
new file mode 100644
index 000000000000..2cccaa0ba7cd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/errorcodes.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs private error codes
+----------------------------
+
+In bcachefs, as a hard rule we do not throw or directly use standard error
+codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
+in fs/bcachefs/errcode.h.
+
+This gives us much better error messages and makes debugging much easier. Any
+direct uses of standard error codes you see in the source code are simply old
+code that has yet to be converted - feel free to clean it up!
+
+Private error codes may subtype another error code, this allows for grouping of
+related errors that should be handled similarly (e.g. transaction restart
+errors), as well as specifying which standard error code should be returned at
+the bcachefs module boundary.
+
+At the module boundary, we use bch2_err_class() to convert to a standard error
+code; this also emits a trace event so that the original error code be
+recovered even if it wasn't logged.
+
+Do not reuse error codes! Generally speaking, a private error code should only
+be thrown in one place. That means that when we see it in a log message we can
+see, unambiguously, exactly which file and line number it was returned from.
+
+Try to give error codes names that are as reasonably descriptive of the error
+as possible. Frequently, the error will be logged at a place far removed from
+where the error was generated; good names for error codes mean much more
+descriptive and useful error messages.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
new file mode 100644
index 000000000000..7db4d7ceab58
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+bcachefs Documentation
+======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ CodingStyle
+ SubmittingPatches
+ errorcodes
diff --git a/Documentation/filesystems/buffer.rst b/Documentation/filesystems/buffer.rst
new file mode 100644
index 000000000000..ae24faf68efa
--- /dev/null
+++ b/Documentation/filesystems/buffer.rst
@@ -0,0 +1,12 @@
+Buffer Heads
+============
+
+Linux uses buffer heads to maintain state about individual filesystem blocks.
+Buffer heads are deprecated and new filesystems should use iomap instead.
+
+Functions
+---------
+
+.. kernel-doc:: include/linux/buffer_head.h
+.. kernel-doc:: fs/buffer.c
+ :export:
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst
index e04a27bdbe19..b3ccc782cb3b 100644
--- a/Documentation/filesystems/caching/cachefiles.rst
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available:
This mask can also be set through sysfs, eg::
- echo 5 >/sys/modules/cachefiles/parameters/debug
+ echo 5 > /sys/module/cachefiles/parameters/debug
Starting the Cache
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
index a74d7b052dc1..de1f32526cc1 100644
--- a/Documentation/filesystems/caching/fscache.rst
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -318,10 +318,10 @@ where the columns are:
Debugging
=========
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in::
+If CONFIG_NETFS_DEBUG is enabled, the FS-Cache facility and NETFS support can
+have runtime debugging enabled by adjusting the value in::
- /sys/module/fscache/parameters/debug
+ /sys/module/netfs/parameters/debug
This is a bitmask of debugging streams to enable:
@@ -343,6 +343,6 @@ This is a bitmask of debugging streams to enable:
The appropriate set of values should be OR'd together and the result written to
the control file. For example::
- echo $((1|8|512)) >/sys/module/fscache/parameters/debug
+ echo $((1|8|512)) >/sys/module/netfs/parameters/debug
will turn on all function entry debugging.
diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst
index 085f309ece60..6d2276a87a5a 100644
--- a/Documentation/filesystems/ceph.rst
+++ b/Documentation/filesystems/ceph.rst
@@ -67,12 +67,15 @@ Snapshot names have two limitations:
more than 255 characters, and `<node-id>` takes 13 characters, the long
snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
-Ceph also provides some recursive accounting on directories for nested
-files and bytes. That is, a 'getfattr -d foo' on any directory in the
-system will reveal the total number of nested regular files and
-subdirectories, and a summation of all nested file sizes. This makes
-the identification of large disk space consumers relatively quick, as
-no 'du' or similar recursive scan of the file system is required.
+Ceph also provides some recursive accounting on directories for nested files
+and bytes. You can run the commands::
+
+ getfattr -n ceph.dir.rfiles /some/dir
+ getfattr -n ceph.dir.rbytes /some/dir
+
+to get the total number of nested files and their combined size, respectively.
+This makes the identification of large disk space consumers relatively quick,
+as no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index dc35da8b8792..f7f977ffbf8d 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -211,18 +211,16 @@ seq_file content.
There are a couple of other directory-oriented helper functions::
- struct dentry *debugfs_rename(struct dentry *old_dir,
- struct dentry *old_dentry,
- struct dentry *new_dir,
- const char *new_name);
+ struct dentry *debugfs_change_name(struct dentry *dentry,
+ const char *fmt, ...);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
-A call to debugfs_rename() will give a new name to an existing debugfs
-file, possibly in a different directory. The new_name must not exist prior
-to the call; the return value is old_dentry with updated information.
+A call to debugfs_change_name() will give a new name to an existing debugfs
+file, always in the same directory. The new_name must not exist prior
+to the call; the return value is 0 on success and -E... on failuer.
Symbolic links can be created with debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index 05ea387bc9fb..6fdf0b02df43 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -44,7 +44,7 @@ For our purposes all operations fall in 6 classes:
* decide which of the source and target need to be locked.
The source needs to be locked if it's a non-directory, target - if it's
a non-directory or about to be removed.
- * take the locks that need to be taken (exlusive), in inode pointer order
+ * take the locks that need to be taken (exclusive), in inode pointer order
if need to take both (that can happen only when both source and target
are non-directories - the source because it wouldn't need to be locked
otherwise and the target because mixing directory and non-directory is
@@ -234,7 +234,7 @@ among the children, in some order. But that is also impossible, since
neither of the children is a descendent of another.
That concludes the proof, since the set of operations with the
-properties requiered for a minimal deadlock can not exist.
+properties required for a minimal deadlock can not exist.
Note that the check for having a common ancestor in cross-directory
rename is crucial - without it a deadlock would be possible. Indeed,
diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst
index 7e2b1fd471d7..70d4e48242c3 100644
--- a/Documentation/filesystems/dlmfs.rst
+++ b/Documentation/filesystems/dlmfs.rst
@@ -36,7 +36,7 @@ None
Usage
=====
-If you're just interested in OCFS2, then please see ocfs2.txt. The
+If you're just interested in OCFS2, then please see ocfs2.rst. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 0551985821b8..f646c3f0980f 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -40,4 +40,4 @@ accidentally.
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
-- Documentation/ABI/stable/sysfs-firmware-efi-vars
+- Documentation/ABI/removed/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index cc4626d6ee4f..c293f8e37468 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -75,7 +75,7 @@ Here are the main features of EROFS:
- Support merging tail-end data into a special inode as fragments.
- - Support large folios for uncompressed files.
+ - Support large folios to make use of THPs (Transparent Hugepages);
- Support direct I/O on uncompressed files to avoid double caching for loop
devices;
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index d32c6209685d..fb7d2ee022bc 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -126,9 +126,7 @@ norecovery Disable the roll-forward recovery routine, mounted read-
discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
enabled, f2fs will issue discard/TRIM commands when a
segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
+heap/no_heap Deprecated.
nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
by default if CONFIG_F2FS_FS_XATTR is selected.
noacl Disable POSIX Access Control List. Note: acl is enabled
@@ -184,29 +182,31 @@ fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
- =================== ===========
- Type_Name Type_Value
- =================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010 (obsolete)
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- FAULT_SLAB_ALLOC 0x000008000
- FAULT_DQUOT_INIT 0x000010000
- FAULT_LOCK_OP 0x000020000
- FAULT_BLKADDR 0x000040000
- =================== ===========
+ =========================== ===========
+ Type_Name Type_Value
+ =========================== ===========
+ FAULT_KMALLOC 0x000000001
+ FAULT_KVMALLOC 0x000000002
+ FAULT_PAGE_ALLOC 0x000000004
+ FAULT_PAGE_GET 0x000000008
+ FAULT_ALLOC_BIO 0x000000010 (obsolete)
+ FAULT_ALLOC_NID 0x000000020
+ FAULT_ORPHAN 0x000000040
+ FAULT_BLOCK 0x000000080
+ FAULT_DIR_DEPTH 0x000000100
+ FAULT_EVICT_INODE 0x000000200
+ FAULT_TRUNCATE 0x000000400
+ FAULT_READ_IO 0x000000800
+ FAULT_CHECKPOINT 0x000001000
+ FAULT_DISCARD 0x000002000
+ FAULT_WRITE_IO 0x000004000
+ FAULT_SLAB_ALLOC 0x000008000
+ FAULT_DQUOT_INIT 0x000010000
+ FAULT_LOCK_OP 0x000020000
+ FAULT_BLKADDR_VALIDITY 0x000040000
+ FAULT_BLKADDR_CONSISTENCE 0x000080000
+ FAULT_NO_SEGMENT 0x000100000
+ =========================== ===========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@@ -228,8 +228,6 @@ mode=%s Control block allocation mode which supports "adaptive"
option for more randomness.
Please, use these options for your experiments and we strongly
recommend to re-format the filesystem after using these options.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
usrquota Enable plain user disk quota accounting.
grpquota Enable plain group disk quota accounting.
prjquota Enable plain project quota accounting.
@@ -776,6 +774,35 @@ In order to identify whether the data in the victim segment are valid or not,
F2FS manages a bitmap. Each bit represents the validity of a block, and the
bitmap is composed of a bit stream covering whole blocks in main area.
+Write-hint Policy
+-----------------
+
+F2FS sets the whint all the time with the below policy.
+
+===================== ======================== ===================
+User F2FS Block
+===================== ======================== ===================
+N/A META WRITE_LIFE_NONE|REQ_META
+N/A HOT_NODE WRITE_LIFE_NONE
+N/A WARM_NODE WRITE_LIFE_MEDIUM
+N/A COLD_NODE WRITE_LIFE_LONG
+ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
+extension list " "
+
+-- buffered io
+N/A COLD_DATA WRITE_LIFE_EXTREME
+N/A HOT_DATA WRITE_LIFE_SHORT
+N/A WARM_DATA WRITE_LIFE_NOT_SET
+
+-- direct io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG " WRITE_LIFE_LONG
+===================== ======================== ===================
+
Fallocate(2) Policy
-------------------
@@ -916,3 +943,47 @@ NVMe Zoned Namespace devices
can start before the zone-capacity and span across zone-capacity boundary.
Such spanning segments are also considered as usable segments. All blocks
past the zone-capacity are considered unusable in these segments.
+
+Device aliasing feature
+-----------------------
+
+f2fs can utilize a special file called a "device aliasing file." This file allows
+the entire storage device to be mapped with a single, large extent, not using
+the usual f2fs node structures. This mapped area is pinned and primarily intended
+for holding the space.
+
+Essentially, this mechanism allows a portion of the f2fs area to be temporarily
+reserved and used by another filesystem or for different purposes. Once that
+external usage is complete, the device aliasing file can be deleted, releasing
+the reserved space back to F2FS for its own use.
+
+<use-case>
+
+# ls /dev/vd*
+/dev/vdb (32GB) /dev/vdc (32GB)
+# mkfs.ext4 /dev/vdc
+# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
+# mount /dev/vdb /mnt/f2fs
+# ls -l /mnt/f2fs
+vdc.file
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+
+# mount -o loop /dev/vdc /mnt/ext4
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+/dev/loop7 32G 24K 30G 1% /mnt/ext4
+# umount /mnt/ext4
+
+# f2fs_io getflags /mnt/f2fs/vdc.file
+get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
+# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
+get a flag on noimmutable ret=0, flags=800010
+set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
+# rm /mnt/f2fs/vdc.file
+# df -h
+/dev/vdb 64G 753M 64G 2% /mnt/f2fs
+
+So, the key idea is, user can do any file operations on /dev/vdc, and
+reclaim the space after the use, while the space is counted as /data.
+That doesn't require modifying partition size and filesystem format.
diff --git a/Documentation/filesystems/fiemap.rst b/Documentation/filesystems/fiemap.rst
index 93fc96f760aa..23b3ed229e49 100644
--- a/Documentation/filesystems/fiemap.rst
+++ b/Documentation/filesystems/fiemap.rst
@@ -12,21 +12,10 @@ returns a list of extents.
Request Basics
--------------
-A fiemap request is encoded within struct fiemap::
-
- struct fiemap {
- __u64 fm_start; /* logical offset (inclusive) at
- * which to start mapping (in) */
- __u64 fm_length; /* logical length of mapping which
- * userspace cares about (in) */
- __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
- __u32 fm_mapped_extents; /* number of extents that were
- * mapped (out) */
- __u32 fm_extent_count; /* size of fm_extents array (in) */
- __u32 fm_reserved;
- struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
- };
+A fiemap request is encoded within struct fiemap:
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap
fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
@@ -60,6 +49,8 @@ FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_CACHE
+ This flag requests caching of the extents.
Extent Mapping
--------------
@@ -77,18 +68,10 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents::
-
- struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
- };
+returned in fm_extents:
+
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap_extent
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -175,6 +158,8 @@ FIEMAP_EXTENT_MERGED
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.
+FIEMAP_EXTENT_SHARED
+ This flag is set to request that space be shared with other files.
VFS -> File System Implementation
---------------------------------
@@ -191,14 +176,10 @@ each discovered extent::
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request::
-
- struct fiemap_extent_info {
- unsigned int fi_flags; /* Flags as passed from user */
- unsigned int fi_extents_mapped; /* Number of mapped extents */
- unsigned int fi_extents_max; /* Size of fiemap_extent array */
- struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
- };
+fiemap request:
+
+.. kernel-doc:: include/linux/fiemap.h
+ :identifiers: fiemap_extent_info
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index 9e38e4c221ca..eb770f891b27 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -116,7 +116,7 @@ before and after the reference count increment. This pattern can be seen
in get_file_rcu() and __files_get_rcu().
In addition, it isn't possible to access or check fields in struct file
-without first aqcuiring a reference on it under rcu lookup. Not doing
+without first acquiring a reference on it under rcu lookup. Not doing
that was always very dodgy and it was only usable for non-pointer data
in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
either first acquire a reference or they must hold the files_lock of the
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index e86b886b64d0..04eaab01314b 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -338,11 +338,14 @@ Supported modes
Currently, the following pairs of encryption modes are supported:
-- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
+- AES-256-XTS for contents and AES-256-CBC-CTS for filenames
- AES-256-XTS for contents and AES-256-HCTR2 for filenames
- Adiantum for both contents and filenames
-- AES-128-CBC-ESSIV for contents and AES-128-CTS-CBC for filenames
-- SM4-XTS for contents and SM4-CTS-CBC for filenames
+- AES-128-CBC-ESSIV for contents and AES-128-CBC-CTS for filenames
+- SM4-XTS for contents and SM4-CBC-CTS for filenames
+
+Note: in the API, "CBC" means CBC-ESSIV, and "CTS" means CBC-CTS.
+So, for example, FSCRYPT_MODE_AES_256_CTS means AES-256-CBC-CTS.
Authenticated encryption modes are not currently supported because of
the difficulty of dealing with ciphertext expansion. Therefore,
@@ -351,11 +354,11 @@ contents encryption uses a block cipher in `XTS mode
`CBC-ESSIV mode
<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_,
or a wide-block cipher. Filenames encryption uses a
-block cipher in `CTS-CBC mode
+block cipher in `CBC-CTS mode
<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block
cipher.
-The (AES-256-XTS, AES-256-CTS-CBC) pair is the recommended default.
+The (AES-256-XTS, AES-256-CBC-CTS) pair is the recommended default.
It is also the only option that is *guaranteed* to always be supported
if the kernel supports fscrypt at all; see `Kernel config options`_.
@@ -364,7 +367,7 @@ upgrades the filenames encryption to use a wide-block cipher. (A
*wide-block cipher*, also called a tweakable super-pseudorandom
permutation, has the property that changing one bit scrambles the
entire result.) As described in `Filenames encryption`_, a wide-block
-cipher is the ideal mode for the problem domain, though CTS-CBC is the
+cipher is the ideal mode for the problem domain, though CBC-CTS is the
"least bad" choice among the alternatives. For more information about
HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_.
@@ -375,13 +378,13 @@ the work is done by XChaCha12, which is much faster than AES when AES
acceleration is unavailable. For more information about Adiantum, see
`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
-The (AES-128-CBC-ESSIV, AES-128-CTS-CBC) pair exists only to support
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support
systems whose only form of AES acceleration is an off-CPU crypto
accelerator such as CAAM or CESA that does not support XTS.
The remaining mode pairs are the "national pride ciphers":
-- (SM4-XTS, SM4-CTS-CBC)
+- (SM4-XTS, SM4-CBC-CTS)
Generally speaking, these ciphers aren't "bad" per se, but they
receive limited security review compared to the usual choices such as
@@ -393,7 +396,7 @@ Kernel config options
Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in
only the basic support from the crypto API needed to use AES-256-XTS
-and AES-256-CTS-CBC encryption. For optimal performance, it is
+and AES-256-CBC-CTS encryption. For optimal performance, it is
strongly recommended to also enable any available platform-specific
kconfig options that provide acceleration for the algorithm(s) you
wish to use. Support for any "non-default" encryption modes typically
@@ -407,7 +410,7 @@ kernel crypto API (see `Inline encryption support`_); in that case,
the file contents mode doesn't need to supported in the kernel crypto
API, but the filenames mode still does.
-- AES-256-XTS and AES-256-CTS-CBC
+- AES-256-XTS and AES-256-CBC-CTS
- Recommended:
- arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
- x86: CONFIG_CRYPTO_AES_NI_INTEL
@@ -433,7 +436,7 @@ API, but the filenames mode still does.
- x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
- x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
-- AES-128-CBC-ESSIV and AES-128-CTS-CBC:
+- AES-128-CBC-ESSIV and AES-128-CBC-CTS:
- Mandatory:
- CONFIG_CRYPTO_ESSIV
- CONFIG_CRYPTO_SHA256 or another SHA-256 implementation
@@ -521,7 +524,7 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames share a
+With CBC-CTS, the IV reuse means that when the plaintext filenames share a
common prefix at least as long as the cipher block size (16 bytes for AES), the
corresponding encrypted filenames will also share a common prefix. This is
undesirable. Adiantum and HCTR2 do not have this weakness, as they are
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 13e4b18e5dbb..76e538217868 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -16,7 +16,7 @@ btrfs filesystems. Like fscrypt, not too much filesystem-specific
code is needed to support fs-verity.
fs-verity is similar to `dm-verity
-<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+<https://www.kernel.org/doc/Documentation/admin-guide/device-mapper/verity.rst>`_
but works on files rather than block devices. On regular files on
filesystems supporting fs-verity, userspace can execute an ioctl that
causes the filesystem to build a Merkle tree for the file and persist
@@ -86,6 +86,16 @@ authenticating fs-verity file hashes include:
signature in their "security.ima" extended attribute, as controlled
by the IMA policy. For more information, see the IMA documentation.
+- Integrity Policy Enforcement (IPE). IPE supports enforcing access
+ control decisions based on immutable security properties of files,
+ including those protected by fs-verity's built-in signatures.
+ "IPE policy" specifically allows for the authorization of fs-verity
+ files using properties ``fsverity_digest`` for identifying
+ files by their verity digest, and ``fsverity_signature`` to authorize
+ files with a verified fs-verity's built-in signature. For
+ details on configuring IPE policies and understanding its operational
+ modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`.
+
- Trusted userspace code in combination with `Built-in signature
verification`_. This approach should be used only with great care.
@@ -457,7 +467,11 @@ Enabling this option adds the following:
On success, the ioctl persists the signature alongside the Merkle
tree. Then, any time the file is opened, the kernel verifies the
file's actual digest against this signature, using the certificates
- in the ".fs-verity" keyring.
+ in the ".fs-verity" keyring. This verification happens as long as the
+ file's signature exists, regardless of the state of the sysctl variable
+ "fs.verity.require_signatures" described in the next item. The IPE LSM
+ relies on this behavior to recognize and label fsverity files
+ that contain a verified built-in fsverity signature.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
@@ -481,7 +495,7 @@ be carefully considered before using them:
- Builtin signature verification does *not* make the kernel enforce
that any files actually have fs-verity enabled. Thus, it is not a
- complete authentication policy. Currently, if it is used, the only
+ complete authentication policy. Currently, if it is used, one
way to complete the authentication policy is for trusted userspace
code to explicitly check whether files have fs-verity enabled with a
signature before they are accessed. (With
@@ -490,6 +504,15 @@ be carefully considered before using them:
could just store the signature alongside the file and verify it
itself using a cryptographic library, instead of using this feature.
+- Another approach is to utilize fs-verity builtin signature
+ verification in conjunction with the IPE LSM, which supports defining
+ a kernel-enforced, system-wide authentication policy that allows only
+ files with a verified fs-verity builtin signature to perform certain
+ operations, such as execution. Note that IPE doesn't require
+ fs.verity.require_signatures=1.
+ Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for
+ more details.
+
- A file's builtin signature can only be set at the same time that
fs-verity is being enabled on the file. Changing or deleting the
builtin signature later requires re-creating the file.
diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
new file mode 100644
index 000000000000..d73dd0dbd238
--- /dev/null
+++ b/Documentation/filesystems/fuse-io-uring.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+FUSE-over-io-uring design documentation
+=======================================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through io-uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through io-uring, userspace
+is required to also handle requests through /dev/fuse after io-uring setup
+is complete. Specifically notifications (initiated from the daemon side)
+and interrupts.
+
+Fuse io-uring configuration
+===========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until io-uring setup is complete.
+
+In order to set up fuse-over-io-uring fuse-server (user-space)
+needs to submit SQEs (opcode = IORING_OP_URING_CMD) to the /dev/fuse
+connection file descriptor. Initial submit is with the sub command
+FUSE_URING_REQ_REGISTER, which will just register entries to be
+available in the kernel.
+
+Once at least one entry per queue is submitted, kernel starts
+to enqueue to ring queues.
+Note, every CPU core has its own fuse-io-uring queue.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------::
+
+ | | FUSE filesystem daemon
+ | |
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD /
+ | | FUSE_URING_CMD_REGISTER
+ | | [wait cqe]
+ | | >io_uring_wait_cqe() or
+ | | >io_uring_submit_and_wait()
+ | |
+ | >fuse_uring_cmd() |
+ | >fuse_uring_register() |
+
+
+Sending requests with CQEs
+--------------------------::
+
+ | | FUSE filesystem daemon
+ | | [waiting for CQEs]
+ | "rm /mnt/fuse/file" |
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [allocate request] |
+ | >fuse_send_one() |
+ | ... |
+ | >fuse_uring_queue_fuse_req |
+ | [queue request on fg queue] |
+ | >fuse_uring_add_req_to_ring_ent() |
+ | ... |
+ | >fuse_uring_copy_to_ring() |
+ | >io_uring_cmd_done() |
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | [receives and handles CQE]
+ | | [submit result and fetch next]
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD/
+ | | FUSE_URING_CMD_COMMIT_AND_FETCH
+ | >fuse_uring_cmd() |
+ | >fuse_uring_commit_fetch() |
+ | >fuse_uring_commit() |
+ | >fuse_uring_copy_from_ring() |
+ | [ copy the result to the fuse req] |
+ | >fuse_uring_req_end() |
+ | >fuse_request_end() |
+ | [wake up req->waitq] |
+ | >fuse_uring_next_fuse_req |
+ | [wait or handle next req] |
+ | |
+ | [req->waitq woken up] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+
+
diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2-glocks.rst
index 8a5842929b60..adc0d4c4d979 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -40,14 +40,14 @@ shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-========== ========== ============== ========== ==============
-Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
-========== ========== ============== ========== ==============
- UN No No No No
- SH Yes Yes No No
- DF No Yes No No
- EX Yes Yes Yes Yes
-========== ========== ============== ========== ==============
+========== ============== ========== ========== ==============
+Glock mode Cache Metadata Cache data Dirty Data Dirty Metadata
+========== ============== ========== ========== ==============
+ UN No No No No
+ DF Yes No No No
+ SH Yes Yes No No
+ EX Yes Yes Yes Yes
+========== ============== ========== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -55,23 +55,22 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-============= =============================================================
+============== =============================================================
Field Purpose
-============= =============================================================
-go_xmote_th Called before remote state change (e.g. to sync dirty data)
+============== =============================================================
+go_sync Called before remote state change (e.g. to sync dirty data)
go_xmote_bh Called after remote state change (e.g. to refill cache)
go_inval Called if remote state change requires invalidating the cache
-go_demote_ok Returns boolean value of whether its ok to demote a glock
- (e.g. checks timeout, and that there is no cached data)
-go_lock Called for the first local holder of a lock
-go_unlock Called on the final local unlock of a lock
+go_instantiate Called when a glock has been acquired
+go_held Called every time a glock holder is acquired
go_dump Called to print content of object for debugfs file, or on
error to dump glock to the log.
-go_type The type of the glock, ``LM_TYPE_*``
go_callback Called if the DLM sends a callback to drop this lock
+go_unlocked Called when a glock is unlocked (dlm_unlock())
+go_type The type of the glock, ``LM_TYPE_*``
go_flags GLOF_ASPACE is set, if the glock has an address space
associated with it
-============= =============================================================
+============== =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
@@ -82,26 +81,24 @@ to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
-There is a plan to try and remove the go_lock and go_unlock callbacks
-if possible, in order to try and speed up the fast path though the locking.
-Also, eventually we hope to make the glock "EX" mode locally shared
-such that any local locking will be done with the i_mutex as required
-rather than via the glock.
+Eventually, we hope to make the glock "EX" mode locally shared such that any
+local locking will be done with the i_mutex as required rather than via the
+glock.
Locking rules for glock operations:
-============= ====================== =============================
+============== ====================== =============================
Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
-============= ====================== =============================
-go_xmote_th Yes No
+============== ====================== =============================
+go_sync Yes No
go_xmote_bh Yes No
go_inval Yes No
-go_demote_ok Sometimes Yes
-go_lock Yes No
-go_unlock Yes No
+go_instantiate No No
+go_held No No
go_dump Sometimes Yes
go_callback Sometimes (N/A) Yes
-============= ====================== =============================
+go_unlocked Yes No
+============== ====================== =============================
.. Note::
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
index ac0af679e61e..2a206129f828 100644
--- a/Documentation/filesystems/idmappings.rst
+++ b/Documentation/filesystems/idmappings.rst
@@ -63,8 +63,8 @@ what id ``k11000`` corresponds to in the second or third idmapping. The
straightforward algorithm to use is to apply the inverse of the first idmapping,
mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
either the second idmapping mapping or third idmapping mapping. The second
-idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
-``u1000`` down to ``u31000``.
+idmapping would map ``u1000`` down to ``k21000``. The third idmapping would map
+``u1000`` down to ``k31000``.
If we were given the same task for the following three idmappings::
@@ -821,7 +821,7 @@ the same idmapping to the mount. We now perform three steps:
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k20000:r10000, k21000) = u1000
@@ -854,10 +854,10 @@ The same translation algorithm works with the third example.
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
- from_kuid(u0:k0:r4294967295, k21000) = u1000
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
So the ownership that lands on disk will be ``u1000``.
@@ -994,7 +994,7 @@ from above:::
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k0:r4294967295, k1000) = u1000
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index e18bc5ae3b35..2636f2a41bd3 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,11 +29,13 @@ algorithms work.
fiemap
files
locks
+ multigrain-ts
mount_api
quota
seq_file
sharedsubtree
idmappings
+ iomap/index
automount-support
@@ -50,6 +52,7 @@ filesystem implementations.
.. toctree::
:maxdepth: 2
+ buffer
journalling
fscrypt
fsverity
@@ -69,6 +72,7 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
+ bcachefs/index
befs
bfs
btrfs
@@ -94,11 +98,11 @@ Documentation for filesystem implementations.
hpfs
fuse
fuse-io
+ fuse-io-uring
inotify
isofs
nilfs2
nfs/index
- ntfs
ntfs3
ocfs2
ocfs2-online-filecheck
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
new file mode 100644
index 000000000000..b0d0188a095e
--- /dev/null
+++ b/Documentation/filesystems/iomap/design.rst
@@ -0,0 +1,441 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_design:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+==============
+Library Design
+==============
+
+.. contents:: Table of Contents
+ :local:
+
+Introduction
+============
+
+iomap is a filesystem library for handling common file operations.
+The library has two layers:
+
+ 1. A lower layer that provides an iterator over ranges of file offsets.
+ This layer tries to obtain mappings of each file ranges to storage
+ from the filesystem, but the storage information is not necessarily
+ required.
+
+ 2. An upper layer that acts upon the space mappings provided by the
+ lower layer iterator.
+
+The iteration can involve mappings of file's logical offset ranges to
+physical extents, but the storage layer information is not necessarily
+required, e.g. for walking cached file information.
+The library exports various APIs for implementing file operations such
+as:
+
+ * Pagecache reads and writes
+ * Folio write faults to the pagecache
+ * Writeback of dirty folios
+ * Direct I/O reads and writes
+ * fsdax I/O reads, writes, loads, and stores
+ * FIEMAP
+ * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
+ * swapfile activation
+
+This origins of this library is the file I/O path that XFS once used; it
+has now been extended to cover several other operations.
+
+Who Should Read This?
+=====================
+
+The target audience for this document are filesystem, storage, and
+pagecache programmers and code reviewers.
+
+If you are working on PCI, machine architectures, or device drivers, you
+are most likely in the wrong place.
+
+How Is This Better?
+===================
+
+Unlike the classic Linux I/O model which breaks file I/O into small
+units (generally memory pages or blocks) and looks up space mappings on
+the basis of that unit, the iomap model asks the filesystem for the
+largest space mappings that it can create for a given file operation and
+initiates operations on that basis.
+This strategy improves the filesystem's visibility into the size of the
+operation being performed, which enables it to combat fragmentation with
+larger space allocations when possible.
+Larger space mappings improve runtime performance by amortizing the cost
+of mapping function calls into the filesystem across a larger amount of
+data.
+
+At a high level, an iomap operation `looks like this
+<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
+
+1. For each byte in the operation range...
+
+ 1. Obtain a space mapping via ``->iomap_begin``
+
+ 2. For each sub-unit of work...
+
+ 1. Revalidate the mapping and go back to (1) above, if necessary.
+ So far only the pagecache operations need to do this.
+
+ 2. Do the work
+
+ 3. Increment operation cursor
+
+ 4. Release the mapping via ``->iomap_end``, if necessary
+
+Each iomap operation will be covered in more detail below.
+This library was covered previously by an `LWN article
+<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
+<https://kernelnewbies.org/KernelProjects/iomap>`_.
+
+The goal of this document is to provide a brief discussion of the
+design and capabilities of iomap, followed by a more detailed catalog
+of the interfaces presented by iomap.
+If you change iomap, please update this design document.
+
+File Range Iterator
+===================
+
+Definitions
+-----------
+
+ * **buffer head**: Shattered remnants of the old buffer cache.
+
+ * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
+
+ * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
+ Processes hold this in shared mode to read file state and contents.
+ Some filesystems may allow shared mode for writes.
+ Processes often hold this in exclusive mode to change file state and
+ contents.
+
+ * ``invalidate_lock``: The pagecache ``struct address_space``
+ rwsemaphore that protects against folio insertion and removal for
+ filesystems that support punching out folios below EOF.
+ Processes wishing to insert folios must hold this lock in shared
+ mode to prevent removal, though concurrent insertion is allowed.
+ Processes wishing to remove folios must hold this lock in exclusive
+ mode to prevent insertions.
+ Concurrent removals are not allowed.
+
+ * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
+ device pre-shutdown hook from returning before other threads have
+ released resources.
+
+ * **filesystem mapping lock**: This synchronization primitive is
+ internal to the filesystem and must protect the file mapping data
+ from updates while a mapping is being sampled.
+ The filesystem author must determine how this coordination should
+ happen; it does not need to be an actual lock.
+
+ * **iomap internal operation lock**: This is a general term for
+ synchronization primitives that iomap functions take while holding a
+ mapping.
+ A specific example would be taking the folio lock while reading or
+ writing the pagecache.
+
+ * **pure overwrite**: A write operation that does not require any
+ metadata or zeroing operations to perform during either submission
+ or completion.
+ This implies that the filesystem must have already allocated space
+ on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
+ constraints on IO alignment or size.
+ The only constraints on I/O alignment are device level (minimum I/O
+ size and alignment, typically sector size).
+
+``struct iomap``
+----------------
+
+The filesystem communicates to the iomap iterator the mapping of
+byte ranges of a file to byte ranges of a storage device with the
+structure below:
+
+.. code-block:: c
+
+ struct iomap {
+ u64 addr;
+ loff_t offset;
+ u64 length;
+ u16 type;
+ u16 flags;
+ struct block_device *bdev;
+ struct dax_device *dax_dev;
+ void *inline_data;
+ void *private;
+ const struct iomap_folio_ops *folio_ops;
+ u64 validity_cookie;
+ };
+
+The fields are as follows:
+
+ * ``offset`` and ``length`` describe the range of file offsets, in
+ bytes, covered by this mapping.
+ These fields must always be set by the filesystem.
+
+ * ``type`` describes the type of the space mapping:
+
+ * **IOMAP_HOLE**: No storage has been allocated.
+ This type must never be returned in response to an ``IOMAP_WRITE``
+ operation because writes must allocate and map space, and return
+ the mapping.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+ iomap does not support writing (whether via pagecache or direct
+ I/O) to a hole.
+
+ * **IOMAP_DELALLOC**: A promise to allocate space at a later time
+ ("delayed allocation").
+ If the filesystem returns IOMAP_F_NEW here and the write fails, the
+ ``->iomap_end`` function must delete the reservation.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * **IOMAP_MAPPED**: The file range maps to specific space on the
+ storage device.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+
+ * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
+ storage device, but the space has not yet been initialized.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+ Reads from this type of mapping will return zeroes to the caller.
+ For a write or writeback operation, the ioend should update the
+ mapping to MAPPED.
+ Refer to the sections about ioends for more details.
+
+ * **IOMAP_INLINE**: The file range maps to the memory buffer
+ specified by ``inline_data``.
+ For write operation, the ``->iomap_end`` function presumably
+ handles persisting the data.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * ``flags`` describe the status of the space mapping.
+ These flags should be set by the filesystem in ``->iomap_begin``:
+
+ * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
+ Areas that will not be written to must be zeroed.
+ If a write fails and the mapping is a space reservation, the
+ reservation must be deleted.
+
+ * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
+ to access any data written.
+ fdatasync is required to commit these changes to persistent
+ storage.
+ This needs to take into account metadata changes that *may* be made
+ at I/O completion, such as file size updates from direct I/O.
+
+ * **IOMAP_F_SHARED**: The space under the mapping is shared.
+ Copy on write is necessary to avoid corrupting other file data.
+
+ * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
+ heads for pagecache operations.
+ Do not add more uses of this.
+
+ * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
+ coalesced into this single mapping.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
+ regular file data.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
+ be set by the filesystem for its own purposes.
+
+ These flags can be set by iomap itself during file operations.
+ The filesystem should supply an ``->iomap_end`` function if it needs
+ to observe these flags:
+
+ * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
+ using this mapping.
+
+ * **IOMAP_F_STALE**: The mapping was found to be stale.
+ iomap will call ``->iomap_end`` on this mapping and then
+ ``->iomap_begin`` to obtain a new mapping.
+
+ Currently, these flags are only set by pagecache operations.
+
+ * ``addr`` describes the device address, in bytes.
+
+ * ``bdev`` describes the block device for this mapping.
+ This only needs to be set for mapped or unwritten operations.
+
+ * ``dax_dev`` describes the DAX device for this mapping.
+ This only needs to be set for mapped or unwritten operations, and
+ only for a fsdax operation.
+
+ * ``inline_data`` points to a memory buffer for I/O involving
+ ``IOMAP_INLINE`` mappings.
+ This value is ignored for all other mapping types.
+
+ * ``private`` is a pointer to `filesystem-private information
+ <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
+ This value will be passed unchanged to ``->iomap_end``.
+
+ * ``folio_ops`` will be covered in the section on pagecache operations.
+
+ * ``validity_cookie`` is a magic freshness value set by the filesystem
+ that should be used to detect stale mappings.
+ For pagecache operations this is critical for correct operation
+ because page faults can occur, which implies that filesystem locks
+ should not be held between ``->iomap_begin`` and ``->iomap_end``.
+ Filesystems with completely static mappings need not set this value.
+ Only pagecache operations revalidate mappings; see the section about
+ ``iomap_valid`` for details.
+
+``struct iomap_ops``
+--------------------
+
+Every iomap function requires the filesystem to pass an operations
+structure to obtain a mapping and (optionally) to release the mapping:
+
+.. code-block:: c
+
+ struct iomap_ops {
+ int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
+
+ int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap);
+ };
+
+``->iomap_begin``
+~~~~~~~~~~~~~~~~~
+
+iomap operations call ``->iomap_begin`` to obtain one file mapping for
+the range of bytes specified by ``pos`` and ``length`` for the file
+``inode``.
+This mapping should be returned through the ``iomap`` pointer.
+The mapping must cover at least the first byte of the supplied file
+range, but it does not need to cover the entire requested range.
+
+Each iomap operation describes the requested operation through the
+``flags`` argument.
+The exact value of ``flags`` will be documented in the
+operation-specific sections below.
+These flags can, at least in principle, apply generally to iomap
+operations:
+
+ * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
+ block storage.
+
+ * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
+ memory-like storage.
+
+ * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
+ effort attempt to avoid any operation that would result in blocking
+ the submitting task.
+ This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
+ intended for asynchronous applications to keep doing other work
+ instead of waiting for the specific unavailable filesystem resource
+ to become available.
+ Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
+ trylock algorithms.
+ They need to be able to satisfy the entire I/O request range with a
+ single iomap mapping.
+ They need to avoid reading or writing metadata synchronously.
+ They need to avoid blocking memory allocations.
+ They need to avoid waiting on transaction reservations to allow
+ modifications to take place.
+ They probably should not be allocating new space.
+ And so on.
+ If there is any doubt in the filesystem developer's mind as to
+ whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
+ then they should return ``-EAGAIN`` as early as possible rather than
+ start the operation and force the submitting task to block.
+ ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
+ ``RWF_NOWAIT``.
+
+If it is necessary to read existing file contents from a `different
+<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+device or address range on a device, the filesystem should return that
+information via ``srcmap``.
+Only pagecache and fsdax operations support reading from one mapping and
+writing to another.
+
+``->iomap_end``
+~~~~~~~~~~~~~~~
+
+After the operation completes, the ``->iomap_end`` function, if present,
+is called to signal that iomap is finished with a mapping.
+Typically, implementations will use this function to tear down any
+context that were set up in ``->iomap_begin``.
+For example, a write might wish to commit the reservations for the bytes
+that were operated upon and unreserve any space that was not operated
+upon.
+``written`` might be zero if no bytes were touched.
+``flags`` will contain the same value passed to ``->iomap_begin``.
+iomap ops for reads are not likely to need to supply this function.
+
+Both functions should return a negative errno code on error, or zero on
+success.
+
+Preparing for File Operations
+=============================
+
+iomap only handles mapping and I/O.
+Filesystems must still call out to the VFS to check input parameters
+and file state before initiating an I/O operation.
+It does not handle obtaining filesystem freeze protection, updating of
+timestamps, stripping privileges, or access control.
+
+Locking Hierarchy
+=================
+
+iomap requires that filesystems supply their own locking model.
+There are three categories of synchronization primitives, as far as
+iomap is concerned:
+
+ * The **upper** level primitive is provided by the filesystem to
+ coordinate access to different iomap operations.
+ The exact primitive is specific to the filesystem and operation,
+ but is often a VFS inode, pagecache invalidation, or folio lock.
+ For example, a filesystem might take ``i_rwsem`` before calling
+ ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
+ these two file operations from clobbering each other.
+ Pagecache writeback may lock a folio to prevent other threads from
+ accessing the folio until writeback is underway.
+
+ * The **lower** level primitive is taken by the filesystem in the
+ ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
+ access to the file space mapping information.
+ The fields of the iomap object should be filled out while holding
+ this primitive.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring the lower level synchronization primitive.
+ For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
+ while sampling mappings.
+ Filesystems with immutable mapping information may not require
+ synchronization here.
+
+ * The **operation** primitive is taken by an iomap operation to
+ coordinate access to its own internal data structures.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring this primitive.
+ The lower level primitive is not held while acquiring this
+ primitive.
+ For example, pagecache write operations will obtain a file mapping,
+ then grab and lock a folio to copy new contents.
+ It may also lock an internal folio state object to update metadata.
+
+The exact locking requirements are specific to the filesystem; for
+certain operations, some of these locks can be elided.
+All further mentions of locking are *recommendations*, not mandates.
+Each filesystem author must figure out the locking for themself.
+
+Bugs and Limitations
+====================
+
+ * No support for fscrypt.
+ * No support for compression.
+ * No support for fsverity yet.
+ * Strong assumptions that IO should work the way it does on XFS.
+ * Does iomap *actually* work for non-regular file data?
+
+Patches welcome!
diff --git a/Documentation/filesystems/iomap/index.rst b/Documentation/filesystems/iomap/index.rst
new file mode 100644
index 000000000000..3c6a52440250
--- /dev/null
+++ b/Documentation/filesystems/iomap/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+VFS iomap Documentation
+=======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ design
+ operations
+ porting
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
new file mode 100644
index 000000000000..2c7f5df9d8b0
--- /dev/null
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -0,0 +1,728 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_operations:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=========================
+Supported File Operations
+=========================
+
+.. contents:: Table of Contents
+ :local:
+
+Below are a discussion of the high level file operations that iomap
+implements.
+
+Buffered I/O
+============
+
+Buffered I/O is the default file I/O path in Linux.
+File contents are cached in memory ("pagecache") to satisfy reads and
+writes.
+Dirty cache will be written back to disk at some point that can be
+forced via ``fsync`` and variants.
+
+iomap implements nearly all the folio and pagecache management that
+filesystems have to implement themselves under the legacy I/O model.
+This means that the filesystem need not know the details of allocating,
+mapping, managing uptodate and dirty state, or writeback of pagecache
+folios.
+Under the legacy I/O model, this was managed very inefficiently with
+linked lists of buffer heads instead of the per-folio bitmaps that iomap
+uses.
+Unless the filesystem explicitly opts in to buffer heads, they will not
+be used, which makes buffered I/O much more efficient, and the pagecache
+maintainer much happier.
+
+``struct address_space_operations``
+-----------------------------------
+
+The following iomap functions can be referenced directly from the
+address space operations structure:
+
+ * ``iomap_dirty_folio``
+ * ``iomap_release_folio``
+ * ``iomap_invalidate_folio``
+ * ``iomap_is_partially_uptodate``
+
+The following address space operations can be wrapped easily:
+
+ * ``read_folio``
+ * ``readahead``
+ * ``writepages``
+ * ``bmap``
+ * ``swap_activate``
+
+``struct iomap_folio_ops``
+--------------------------
+
+The ``->iomap_begin`` function for pagecache operations may set the
+``struct iomap::folio_ops`` field to an ops structure to override
+default behaviors of iomap:
+
+.. code-block:: c
+
+ struct iomap_folio_ops {
+ struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
+ unsigned len);
+ void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
+ struct folio *folio);
+ bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ };
+
+iomap calls these functions:
+
+ - ``get_folio``: Called to allocate and return an active reference to
+ a locked folio prior to starting a write.
+ If this function is not provided, iomap will call
+ ``iomap_get_folio``.
+ This could be used to `set up per-folio filesystem state
+ <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
+ for a write.
+
+ - ``put_folio``: Called to unlock and put a folio after a pagecache
+ operation completes.
+ If this function is not provided, iomap will ``folio_unlock`` and
+ ``folio_put`` on its own.
+ This could be used to `commit per-folio filesystem state
+ <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
+ that was set up by ``->get_folio``.
+
+ - ``iomap_valid``: The filesystem may not hold locks between
+ ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
+ can take folio locks, fault on userspace pages, initiate writeback
+ for memory reclamation, or engage in other time-consuming actions.
+ If a file's space mapping data are mutable, it is possible that the
+ mapping for a particular pagecache folio can `change in the time it
+ takes
+ <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
+ to allocate, install, and lock that folio.
+
+ For the pagecache, races can happen if writeback doesn't take
+ ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
+ Races can also happen if the filesystem allows concurrent writes.
+ For such files, the mapping *must* be revalidated after the folio
+ lock has been taken so that iomap can manage the folio correctly.
+
+ fsdax does not need this revalidation because there's no writeback
+ and no support for unwritten extents.
+
+ Filesystems subject to this kind of race must provide a
+ ``->iomap_valid`` function to decide if the mapping is still valid.
+ If the mapping is not valid, the mapping will be sampled again.
+
+ To support making the validity decision, the filesystem's
+ ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
+ at the same time that it populates the other iomap fields.
+ A simple validation cookie implementation is a sequence counter.
+ If the filesystem bumps the sequence counter every time it modifies
+ the inode's extent map, it can be placed in the ``struct
+ iomap::validity_cookie`` during ``->iomap_begin``.
+ If the value in the cookie is found to be different to the value
+ the filesystem holds when the mapping is passed back to
+ ``->iomap_valid``, then the iomap should considered stale and the
+ validation failed.
+
+These ``struct kiocb`` flags are significant for buffered I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+Internal per-Folio State
+------------------------
+
+If the fsblock size matches the size of a pagecache folio, it is assumed
+that all disk I/O operations will operate on the entire folio.
+The uptodate (memory contents are at least as new as what's on disk) and
+dirty (memory contents are newer than what's on disk) status of the
+folio are all that's needed for this case.
+
+If the fsblock size is less than the size of a pagecache folio, iomap
+tracks the per-fsblock uptodate and dirty state itself.
+This enables iomap to handle both "bs < ps" `filesystems
+<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
+and large folios in the pagecache.
+
+iomap internally tracks two state bits per fsblock:
+
+ * ``uptodate``: iomap will try to keep folios fully up to date.
+ If there are read(ahead) errors, those fsblocks will not be marked
+ uptodate.
+ The folio itself will be marked uptodate when all fsblocks within the
+ folio are uptodate.
+
+ * ``dirty``: iomap will set the per-block dirty state when programs
+ write to the file.
+ The folio itself will be marked dirty when any fsblock within the
+ folio is dirty.
+
+iomap also tracks the amount of read and write disk IOs that are in
+flight.
+This structure is much lighter weight than ``struct buffer_head``
+because there is only one per folio, and the per-fsblock overhead is two
+bits vs. 104 bytes.
+
+Filesystems wishing to turn on large folios in the pagecache should call
+``mapping_set_large_folios`` when initializing the incore inode.
+
+Buffered Readahead and Reads
+----------------------------
+
+The ``iomap_readahead`` function initiates readahead to the pagecache.
+The ``iomap_read_folio`` function reads one folio's worth of data into
+the pagecache.
+The ``flags`` argument to ``->iomap_begin`` will be set to zero.
+The pagecache takes whatever locks it needs before calling the
+filesystem.
+
+Buffered Writes
+---------------
+
+The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
+pagecache.
+``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
+the ``flags`` argument to ``->iomap_begin``.
+Callers commonly take ``i_rwsem`` in either shared or exclusive mode
+before calling this function.
+
+mmap Write Faults
+~~~~~~~~~~~~~~~~~
+
+The ``iomap_page_mkwrite`` function handles a write fault to a folio in
+the pagecache.
+``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers commonly take the mmap ``invalidate_lock`` in shared or
+exclusive mode before calling this function.
+
+Buffered Write Failures
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a short write to the pagecache, the areas not written will not
+become marked dirty.
+The filesystem must arrange to `cancel
+<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
+such `reservations
+<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
+because writeback will not consume the reservation.
+The ``iomap_write_delalloc_release`` can be called from a
+``->iomap_end`` function to find all the clean areas of the folios
+caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
+It takes the ``invalidate_lock``.
+
+The filesystem must supply a function ``punch`` to be called for
+each file range in this state.
+This function must *only* remove delayed allocation reservations, in
+case another thread racing with the current thread writes successfully
+to the same region and triggers writeback to flush the dirty data out to
+disk.
+
+Zeroing for File Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_zero_range`` to perform zeroing of the
+pagecache for non-truncation file operations that are not aligned to
+the fsblock size.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Unsharing Reflinked File Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_file_unshare`` to force a file sharing
+storage with another file to preemptively copy the shared data to newly
+allocate storage.
+``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Truncation
+----------
+
+Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
+pagecache from EOF to the end of the fsblock during a file truncation
+operation.
+``truncate_setsize`` or ``truncate_pagecache`` will take care of
+everything after the EOF block.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Pagecache Writeback
+-------------------
+
+Filesystems can call ``iomap_writepages`` to respond to a request to
+write dirty pagecache folios to disk.
+The ``mapping`` and ``wbc`` parameters should be passed unchanged.
+The ``wpc`` pointer should be allocated by the filesystem and must
+be initialized to zero.
+
+The pagecache will lock each folio before trying to schedule it for
+writeback.
+It does not lock ``i_rwsem`` or ``invalidate_lock``.
+
+The dirty bit will be cleared for all folios run through the
+``->map_blocks`` machinery described below even if the writeback fails.
+This is to prevent dirty folio clots when storage devices fail; an
+``-EIO`` is recorded for userspace to collect via ``fsync``.
+
+The ``ops`` structure must be specified and is as follows:
+
+``struct iomap_writeback_ops``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ struct iomap_writeback_ops {
+ int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
+ loff_t offset, unsigned len);
+ int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
+ void (*discard_folio)(struct folio *folio, loff_t pos);
+ };
+
+The fields are as follows:
+
+ - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ range (in bytes) given by ``offset`` and ``len``.
+ iomap calls this function for each dirty fs block in each dirty folio,
+ though it will `reuse mappings
+ <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
+ for runs of contiguous dirty fsblocks within a folio.
+ Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
+ function must deal with persisting written data.
+ Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
+ requires mapping to allocated space.
+ Filesystems can skip a potentially expensive mapping lookup if the
+ mappings have not changed.
+ This revalidation must be open-coded by the filesystem; it is
+ unclear if ``iomap::validity_cookie`` can be reused for this
+ purpose.
+ This function must be supplied by the filesystem.
+
+ - ``prepare_ioend``: Enables filesystems to transform the writeback
+ ioend or perform any other preparatory work before the writeback I/O
+ is submitted.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context.
+ This function is optional.
+
+ - ``discard_folio``: iomap calls this function after ``->map_blocks``
+ fails to schedule I/O for any part of a dirty folio.
+ The function should throw away any reservations that may have been
+ made for the write.
+ The folio will be marked clean and an ``-EIO`` recorded in the
+ pagecache.
+ Filesystems can use this callback to `remove
+ <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
+ delalloc reservations to avoid having delalloc reservations for
+ clean pagecache.
+ This function is optional.
+
+Pagecache Writeback Completion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To handle the bookkeeping that must happen after disk I/O for writeback
+completes, iomap creates chains of ``struct iomap_ioend`` objects that
+wrap the ``bio`` that is used to write pagecache data to disk.
+By default, iomap finishes writeback ioends by clearing the writeback
+bit on the folios attached to the ``ioend``.
+If the write failed, it will also set the error bits on the folios and
+the address space.
+This can happen in interrupt or process context, depending on the
+storage device.
+
+Filesystems that need to update internal bookkeeping (e.g. unwritten
+extent conversions) should provide a ``->prepare_ioend`` function to
+set ``struct iomap_end::bio::bi_end_io`` to its own function.
+This function should call ``iomap_finish_ioends`` after finishing its
+own work (e.g. unwritten extent conversion).
+
+Some filesystems may wish to `amortize the cost of running metadata
+transactions
+<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
+for post-writeback updates by batching them.
+They may also require transactions to run from process context, which
+implies punting batches to a workqueue.
+iomap ioends contain a ``list_head`` to enable batching.
+
+Given a batch of ioends, iomap has a few helpers to assist with
+amortization:
+
+ * ``iomap_sort_ioends``: Sort all the ioends in the list by file
+ offset.
+
+ * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
+ a separate list of sorted ioends, merge as many of the ioends from
+ the head of the list into the given ioend.
+ ioends can only be merged if the file range and storage addresses are
+ contiguous; the unwritten and shared status are the same; and the
+ write I/O outcome is the same.
+ The merged ioends become their own list.
+
+ * ``iomap_finish_ioends``: Finish an ioend that possibly has other
+ ioends linked to it.
+
+Direct I/O
+==========
+
+In Linux, direct I/O is defined as file I/O that is issued directly to
+storage, bypassing the pagecache.
+The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
+writes for files.
+
+.. code-block:: c
+
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ const struct iomap_ops *ops,
+ const struct iomap_dio_ops *dops,
+ unsigned int dio_flags, void *private,
+ size_t done_before);
+
+The filesystem can provide the ``dops`` parameter if it needs to perform
+extra work before or after the I/O is issued to storage.
+The ``done_before`` parameter tells the how much of the request has
+already been transferred.
+It is used to continue a request asynchronously when `part of the
+request
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
+has already been completed synchronously.
+
+The ``done_before`` parameter should be set if writes for the ``iocb``
+have been initiated prior to the call.
+The direction of the I/O is determined from the ``iocb`` passed in.
+
+The ``dio_flags`` argument can be set to any combination of the
+following values:
+
+ * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
+ kiocb is not synchronous.
+
+ * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
+ or fail with ``-EAGAIN``.
+ This can be used by filesystems with complex unaligned I/O
+ write paths to provide an optimised fast path for unaligned writes.
+ If a pure overwrite can be performed, then serialisation against
+ other I/Os to the same filesystem block(s) is unnecessary as there is
+ no risk of stale data exposure or data loss.
+ If a pure overwrite cannot be performed, then the filesystem can
+ perform the serialisation steps needed to provide exclusive access
+ to the unaligned I/O range so that it can perform allocation and
+ sub-block zeroing safely.
+ Filesystems can use this flag to try to reduce locking contention,
+ but a lot of `detailed checking
+ <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
+ is required to do it `correctly
+ <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
+
+ * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
+ progress has already been made.
+ The caller may deal with the page fault and retry the operation.
+ If the caller decides to retry the operation, it should pass the
+ accumulated return values of all previous calls as the
+ ``done_before`` parameter to the next call.
+
+These ``struct kiocb`` flags are significant for direct I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
+ before completing the call.
+ In the case of pure overwrites, the I/O may be issued with FUA
+ enabled.
+
+ * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
+ interrupt.
+ Only meaningful for asynchronous I/O, and only if the entire I/O can
+ be issued as a single ``struct bio``.
+
+ * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
+ process context.
+ See ``linux/fs.h`` for more details.
+
+Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
+``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
+function for the file.
+They should not set ``->direct_IO``, which is deprecated.
+
+If a filesystem wishes to perform its own work before direct I/O
+completion, it should call ``__iomap_dio_rw``.
+If its return value is not an error pointer or a NULL pointer, the
+filesystem should pass the return value to ``iomap_dio_complete`` after
+finishing its internal work.
+
+Return Values
+-------------
+
+``iomap_dio_rw`` can return one of the following:
+
+ * A non-negative number of bytes transferred.
+
+ * ``-ENOTBLK``: Fall back to buffered I/O.
+ iomap itself will return this value if it cannot invalidate the page
+ cache before issuing the I/O to storage.
+ The ``->iomap_begin`` or ``->iomap_end`` functions may also return
+ this value.
+
+ * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
+ queued and will be completed separately.
+
+ * Any of the other negative error codes.
+
+Direct Reads
+------------
+
+A direct I/O read initiates a read I/O from the storage device to the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the read io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
+any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Direct Writes
+-------------
+
+A direct I/O write initiates a write I/O to the storage device from the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the write io.
+The pagecache is invalidated both before and after the write io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
+ blocks is not allowed.
+ The entire file range must map to a single written or unwritten
+ extent.
+ The file I/O range must be aligned to the filesystem block size
+ if the mapping is unwritten and the filesystem cannot handle zeroing
+ the unaligned regions without exposing stale contents.
+
+ * ``IOMAP_ATOMIC``: This write is being issued with torn-write
+ protection.
+ Only a single bio can be created for the write, and the write must
+ not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
+ set.
+ The file range to write must be aligned to satisfy the requirements
+ of both the filesystem and the underlying block device's atomic
+ commit capabilities.
+ If filesystem metadata updates are required (e.g. unwritten extent
+ conversion or copy on write), all updates for the entire file range
+ must be committed atomically as well.
+ Only one space mapping is allowed per untorn write.
+ Untorn writes must be aligned to, and must not be longer than, a
+ single file block.
+
+Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
+calling this function.
+
+``struct iomap_dio_ops:``
+-------------------------
+.. code-block:: c
+
+ struct iomap_dio_ops {
+ void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
+ loff_t file_offset);
+ int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
+ unsigned flags);
+ struct bio_set *bio_set;
+ };
+
+The fields of this structure are as follows:
+
+ - ``submit_io``: iomap calls this function when it has constructed a
+ ``struct bio`` object for the I/O requested, and wishes to submit it
+ to the block device.
+ If no function is provided, ``submit_bio`` will be called directly.
+ Filesystems that would like to perform additional work before (e.g.
+ data replication for btrfs) should implement this function.
+
+ - ``end_io``: This is called after the ``struct bio`` completes.
+ This function should perform post-write conversions of unwritten
+ extent mappings, handle write failures, etc.
+ The ``flags`` argument may be set to a combination of the following:
+
+ * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
+ should mark the extent as written.
+
+ * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
+ copy on write operation, so the ioend should switch mappings.
+
+ - ``bio_set``: This allows the filesystem to provide a custom bio_set
+ for allocating direct I/O bios.
+ This enables filesystems to `stash additional per-bio information
+ <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
+ for private use.
+ If this field is NULL, generic ``struct bio`` objects will be used.
+
+Filesystems that want to perform extra work after an I/O completion
+should set a custom ``->bi_end_io`` function via ``->submit_io``.
+Afterwards, the custom endio function must call
+``iomap_dio_bio_end_io`` to finish the direct I/O.
+
+DAX I/O
+=======
+
+Some storage devices can be directly mapped as memory.
+These devices support a new access mode known as "fsdax" that allows
+loads and stores through the CPU and memory controller.
+
+fsdax Reads
+-----------
+
+A fsdax read performs a memcpy from storage device to the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
+combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+fsdax Writes
+------------
+
+A fsdax write initiates a memcpy to the storage device from the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
+ performed from this mapping.
+ This requires the filesystem extent mapping to already exist as an
+ ``IOMAP_MAPPED`` type and span the entire range of the write I/O
+ request.
+ If the filesystem cannot map this request in a way that allows the
+ iomap infrastructure to perform a pure overwrite, it must fail the
+ mapping operation with ``-EAGAIN``.
+
+Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
+function.
+
+fsdax mmap Faults
+~~~~~~~~~~~~~~~~~
+
+The ``dax_iomap_fault`` function handles read and write faults to fsdax
+storage.
+For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
+``flags`` argument to ``->iomap_begin``.
+For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
+passed as the ``flags`` argument to ``->iomap_begin``.
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Truncation, fallocate, and Unsharing
+------------------------------------------
+
+For fsdax files, the following functions are provided to replace their
+iomap pagecache I/O counterparts.
+The ``flags`` argument to ``->iomap_begin`` are the same as the
+pagecache counterparts, with ``IOMAP_DAX`` added.
+
+ * ``dax_file_unshare``
+ * ``dax_zero_range``
+ * ``dax_truncate_page``
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Deduplication
+-------------------
+
+Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
+``dax_remap_file_range_prep`` function with their own iomap read ops.
+
+Seeking Files
+=============
+
+iomap implements the two iterating whence modes of the ``llseek`` system
+call.
+
+SEEK_DATA
+---------
+
+The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with a folio mapped and uptodate fsblocks
+within those folios will be reported as data areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+SEEK_HOLE
+---------
+
+The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with no folio mapped, or a !uptodate fsblock
+within a folio will be reported as sparse hole areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Swap File Activation
+====================
+
+The ``iomap_swapfile_activate`` function finds all the base-page aligned
+regions in a file and sets them up as swap space.
+The file will be ``fsync()``'d before activation.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+All mappings must be mapped or unwritten; cannot be dirty or shared, and
+cannot span multiple block devices.
+Callers must hold ``i_rwsem`` in exclusive mode; this is already
+provided by ``swapon``.
+
+File Space Mapping Reporting
+============================
+
+iomap implements two of the file space mapping system calls.
+
+FS_IOC_FIEMAP
+-------------
+
+The ``iomap_fiemap`` function exports file extent mappings to userspace
+in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+FIBMAP (deprecated)
+-------------------
+
+``iomap_bmap`` implements FIBMAP.
+The calling conventions are the same as for FIEMAP.
+This function is only provided to maintain compatibility for filesystems
+that implemented FIBMAP prior to conversion.
+This ioctl is deprecated; do **not** add a FIBMAP implementation to
+filesystems that do not have it.
+Callers should probably hold ``i_rwsem`` in shared mode before calling
+this function, but this is unclear.
diff --git a/Documentation/filesystems/iomap/porting.rst b/Documentation/filesystems/iomap/porting.rst
new file mode 100644
index 000000000000..3d49a32c0fff
--- /dev/null
+++ b/Documentation/filesystems/iomap/porting.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_porting:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=======================
+Porting Your Filesystem
+=======================
+
+.. contents:: Table of Contents
+ :local:
+
+Why Convert?
+============
+
+There are several reasons to convert a filesystem to iomap:
+
+ 1. The classic Linux I/O path is not terribly efficient.
+ Pagecache operations lock a single base page at a time and then call
+ into the filesystem to return a mapping for only that page.
+ Direct I/O operations build I/O requests a single file block at a
+ time.
+ This worked well enough for direct/indirect-mapped filesystems such
+ as ext2, but is very inefficient for extent-based filesystems such
+ as XFS.
+
+ 2. Large folios are only supported via iomap; there are no plans to
+ convert the old buffer_head path to use them.
+
+ 3. Direct access to storage on memory-like devices (fsdax) is only
+ supported via iomap.
+
+ 4. Lower maintenance overhead for individual filesystem maintainers.
+ iomap handles common pagecache related operations itself, such as
+ allocating, instantiating, locking, and unlocking of folios.
+ No ->write_begin(), ->write_end() or direct_IO
+ address_space_operations are required to be implemented by
+ filesystem using iomap.
+
+How Do I Convert a Filesystem?
+==============================
+
+First, add ``#include <linux/iomap.h>`` from your source code and add
+``select FS_IOMAP`` to your filesystem's Kconfig option.
+Build the kernel, run fstests with the ``-g all`` option across a wide
+variety of your filesystem's supported configurations to build a
+baseline of which tests pass and which ones fail.
+
+The recommended approach is first to implement ``->iomap_begin`` (and
+``->iomap_end`` if necessary) to allow iomap to obtain a read-only
+mapping of a file range.
+In most cases, this is a relatively trivial conversion of the existing
+``get_block()`` function for read-only mappings.
+``FS_IOC_FIEMAP`` is a good first target because it is trivial to
+implement support for it and then to determine that the extent map
+iteration is correct from userspace.
+If FIEMAP is returning the correct information, it's a good sign that
+other read-only mapping operations will do the right thing.
+
+Next, modify the filesystem's ``get_block(create = false)``
+implementation to use the new ``->iomap_begin`` implementation to map
+file space for selected read operations.
+Hide behind a debugging knob the ability to switch on the iomap mapping
+functions for selected call paths.
+It is necessary to write some code to fill out the bufferhead-based
+mapping information from the ``iomap`` structure, but the new functions
+can be tested without needing to implement any iomap APIs.
+
+Once the read-only functions are working like this, convert each high
+level file operation one by one to use iomap native APIs instead of
+going through ``get_block()``.
+Done one at a time, regressions should be self evident.
+You *do* have a regression test baseline for fstests, right?
+It is suggested to convert swap file activation, ``SEEK_DATA``, and
+``SEEK_HOLE`` before tackling the I/O paths.
+A likely complexity at this point will be converting the buffered read
+I/O path because of bufferheads.
+The buffered read I/O paths doesn't need to be converted yet, though the
+direct I/O read path should be converted in this phase.
+
+At this point, you should look over your ``->iomap_begin`` function.
+If it switches between large blocks of code based on dispatching of the
+``flags`` argument, you should consider breaking it up into
+per-operation iomap ops with smaller, more cohesive functions.
+XFS is a good example of this.
+
+The next thing to do is implement ``get_blocks(create == true)``
+functionality in the ``->iomap_begin``/``->iomap_end`` methods.
+It is strongly recommended to create separate mapping functions and
+iomap ops for write operations.
+Then convert the direct I/O write path to iomap, and start running fsx
+w/ DIO enabled in earnest on filesystem.
+This will flush out lots of data integrity corner case bugs that the new
+write mapping implementation introduces.
+
+Now, convert any remaining file operations to call the iomap functions.
+This will get the entire filesystem using the new mapping functions, and
+they should largely be debugged and working correctly after this step.
+
+Most likely at this point, the buffered read and write paths will still
+need to be converted.
+The mapping functions should all work correctly, so all that needs to be
+done is rewriting all the code that interfaces with bufferheads to
+interface with iomap and folios.
+It is much easier first to get regular file I/O (without any fancy
+features like fscrypt, fsverity, compression, or data=journaling)
+converted to use iomap.
+Some of those fancy features (fscrypt and compression) aren't
+implemented yet in iomap.
+For unjournalled filesystems that use the pagecache for symbolic links
+and directories, you might also try converting their handling to iomap.
+
+The rest is left as an exercise for the reader, as it will be different
+for every filesystem.
+If you encounter problems, email the people and lists in
+``get_maintainers.pl`` for help.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index e18f90ffc6fd..0254f7d57429 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -137,7 +137,7 @@ Fast commits
JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you will need to set following
-callbacks that perform correspodning work:
+callbacks that perform corresponding work:
`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.
@@ -149,7 +149,7 @@ File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
file system should tell JBD2 about it by calling
-:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full
commit immediately after stopping the fast commit it can do so by calling
:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
fails for some reason and the only way to guarantee consistency is for JBD2 to
@@ -199,7 +199,7 @@ Journal Level
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
-Transasction Level
+Transaction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d5bf4b6b7509..d20a32b77b60 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -17,7 +17,8 @@ dentry_operations
prototypes::
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -29,7 +30,9 @@ prototypes::
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
locking rules:
@@ -49,6 +52,8 @@ d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
+d_unalias_trylock yes no no no
+d_unalias_unlock yes no no no
================== =========== ======== ============== ========
inode_operations
@@ -251,10 +256,10 @@ prototypes::
void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
+ struct folio **foliop, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -280,7 +285,7 @@ read_folio: yes, unlocks shared
writepages:
dirty_folio: maybe
readahead: yes, unlocks shared
-write_begin: locks the page exclusive
+write_begin: locks the folio exclusive
write_end: yes, unlocks exclusive
bmap:
invalidate_folio: yes exclusive
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index 9aaf6ef75eb5..d92c276f1575 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -645,6 +645,8 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ fs_param_is_uid User ID (u32) result->uid
+ fs_param_is_gid Group ID (u32) result->gid
======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
@@ -678,6 +680,8 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ fsparam_uid() fs_param_is_uid
+ fsparam_gid() fs_param_is_gid
======================= ===============================================
all of which take two arguments, name string and option number - for
@@ -766,7 +770,8 @@ process the parameters it is given.
* ::
- bool fs_validate_description(const struct fs_parameter_description *desc);
+ bool fs_validate_description(const char *name,
+ const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
@@ -784,8 +789,9 @@ process the parameters it is given.
option number (which it returns).
If successful, and if the parameter type indicates the result is a
- boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32,uint_64}.
+ boolean, integer, enum, uid, or gid type, the value is converted by this
+ function and the result stored in
+ result->{boolean,int_32,uint_32,uint_64,uid,gid}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index 000000000000..c779e47284e8
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigrain Timestamps
+=====================
+
+Introduction
+============
+Historically, the kernel has always used coarse time values to stamp inodes.
+This value is updated every jiffy, so any change that happens within that jiffy
+will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+================
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+ The inode change time. This is stamped with the current time whenever
+ the inode's metadata is changed. Note that this value is not settable
+ from userland.
+
+mtime:
+ The inode modification time. This is stamped with the current time
+ any time a file's contents change.
+
+atime:
+ The inode access time. This is stamped whenever an inode's contents are
+ read. Widely considered to be a terrible mistake. Usually avoided with
+ options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+========================
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Note that the above assumes that the system doesn't experience a backward jump
+of the realtime clock. If that occurs at an inopportune time, then timestamps
+can appear to go backward, even on a properly functioning system.
+
+Multigrain Timestamp Implementation
+===================================
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=====================
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and then a second file modified later could get a coarse-grained one
+that appears earlier than the first, which would break the kernel's timestamp
+ordering guarantees.
+
+To mitigate this problem, maintain a global floor value that ensures that
+this can't happen. The two files in the above example may appear to have been
+modified at the same time in such a case, but they will never show the reverse
+order. To avoid problems with realtime clock jumps, the floor is managed as a
+monotonic ktime_t, and the values are converted to realtime clock values as
+needed.
+
+Implementation Notes
+====================
+Multigrain timestamps are intended for use by local filesystems that get
+ctime values from the local clock. This is in contrast to network filesystems
+and the like that just mirror timestamp values from a server.
+
+For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
+fstype->fs_flags in order to opt-in, providing the ctime is only ever set via
+inode_set_ctime_current(). If the filesystem has a ->getattr routine that
+doesn't call generic_fillattr, then it should call fill_mg_cmtime() to
+fill those values. For setattr, it should use setattr_copy() to update the
+timestamps, or otherwise mimic its behavior.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 4cc657d743f7..73f0bfd7e903 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -116,7 +116,7 @@ The following services are provided:
* Handle local caching, allowing cached data and server-read data to be
interleaved for a single request.
- * Handle clearing of bufferage that aren't on the server.
+ * Handle clearing of bufferage that isn't on the server.
* Handle retrying of reads that failed, switching reads from the cache to the
server as necessary.
@@ -592,4 +592,3 @@ API Function Reference
.. kernel-doc:: include/linux/netfs.h
.. kernel-doc:: fs/netfs/buffered_read.c
-.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index f04ce1215a03..de64d2d002a2 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -238,10 +238,3 @@ following flags are defined:
all of an inode's dirty data on last close. Exports that behave this
way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
waiting for writeback when closing such files.
-
- EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
- requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
- it's own ->lock() functionality as core posix_lock_file() implementation
- has no async lock request handling yet. For more information about how to
- indicate an async lock request from a ->lock() file_operations struct, see
- fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 8536134f31fd..95c2c009874c 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -8,6 +8,7 @@ NFS
client-identifier
exporting
+ localio
pnfs
rpc-cache
rpc-server-gss
diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..79808b37d745
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,357 @@
+===========
+NFS LOCALIO
+===========
+
+Overview
+========
+
+The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
+server to reliably handshake to determine if they are on the same
+host. Select "NFS client and server support for LOCALIO auxiliary
+protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
+config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
+
+Once an NFS client and server handshake as "local", the client will
+bypass the network RPC protocol for read, write and commit operations.
+Due to this XDR and RPC bypass, these operations will operate faster.
+
+The LOCALIO auxiliary protocol's implementation, which uses the same
+connection as NFS traffic, follows the pattern established by the NFS
+ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers. In a private implementation that
+preceded use of this LOCALIO protocol, a fragile sockaddr network
+address based match against all local network interfaces was attempted.
+But unlike the LOCALIO protocol, the sockaddr-based matching didn't
+handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate use case this locality makes possible is the
+client is able to open files and issue reads, writes and commits
+directly to the server without having to go over the network. The
+requirement is to perform these loopback NFS operations as efficiently
+as possible, this is particularly useful for container use cases
+(e.g. kubernetes) where it is possible to run an IO job local to the
+server.
+
+The performance advantage realized from LOCALIO's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+
+fio for 20 secs with directio, qd of 8, 16 libaio threads:
+ - With LOCALIO:
+ 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
+ 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
+ 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
+ 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
+ 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
+ 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
+ 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
+
+fio for 20 secs with directio, qd of 8, 1 libaio thread:
+ - With LOCALIO:
+ 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
+ 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
+ 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
+ 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
+ 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
+
+FAQ
+===
+
+1. What are the use cases for LOCALIO?
+
+ a. Workloads where the NFS client and server are on the same host
+ realize improved IO performance. In particular, it is common when
+ running containerised workloads for jobs to find themselves
+ running on the same host as the knfsd server being used for
+ storage.
+
+2. What are the requirements for LOCALIO?
+
+ a. Bypass use of the network RPC protocol as much as possible. This
+ includes bypassing XDR and RPC for open, read, write and commit
+ operations.
+ b. Allow client and server to autonomously discover if they are
+ running local to each other without making any assumptions about
+ the local network topology.
+ c. Support the use of containers by being compatible with relevant
+ namespaces (e.g. network, user, mount).
+ d. Support all versions of NFS. NFSv3 is of particular importance
+ because it has wide enterprise usage and pNFS flexfiles makes use
+ of it for the data path.
+
+3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
+ deciding if the NFS client and server are co-located on the same
+ host?
+
+ Since one of the main use cases is containerised workloads, we cannot
+ assume that IP addresses will be shared between the client and
+ server. This sets up a requirement for a handshake protocol that
+ needs to go over the same connection as the NFS traffic in order to
+ identify that the client and the server really are running on the
+ same host. The handshake uses a secret that is sent over the wire,
+ and can be verified by both parties by comparing with a value stored
+ in shared kernel memory if they are truly co-located.
+
+4. Does LOCALIO improve pNFS flexfiles?
+
+ Yes, LOCALIO complements pNFS flexfiles by allowing it to take
+ advantage of NFS client and server locality. Policy that initiates
+ client IO as closely to the server where the data is stored naturally
+ benefits from the data path optimization LOCALIO provides.
+
+5. Why not develop a new pNFS layout to enable LOCALIO?
+
+ A new pNFS layout could be developed, but doing so would put the
+ onus on the server to somehow discover that the client is co-located
+ when deciding to hand out the layout.
+ There is value in a simpler approach (as provided by LOCALIO) that
+ allows the NFS client to negotiate and leverage locality without
+ requiring more elaborate modeling and discovery of such locality in a
+ more centralized manner.
+
+6. Why is having the client perform a server-side file OPEN, without
+ using RPC, beneficial? Is the benefit pNFS specific?
+
+ Avoiding the use of XDR and RPC for file opens is beneficial to
+ performance regardless of whether pNFS is used. Especially when
+ dealing with small files its best to avoid going over the wire
+ whenever possible, otherwise it could reduce or even negate the
+ benefits of avoiding the wire for doing the small file I/O itself.
+ Given LOCALIO's requirements the current approach of having the
+ client perform a server-side file open, without using RPC, is ideal.
+ If in the future requirements change then we can adapt accordingly.
+
+7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
+
+ Strong authentication is usually tied to the connection itself. It
+ works by establishing a context that is cached by the server, and
+ that acts as the key for discovering the authorisation token, which
+ can then be passed to rpc.mountd to complete the authentication
+ process. On the other hand, in the case of AUTH_UNIX, the credential
+ that was passed over the wire is used directly as the key in the
+ upcall to rpc.mountd. This simplifies the authentication process, and
+ so makes AUTH_UNIX easier to support.
+
+8. How do export options that translate RPC user IDs behave for LOCALIO
+ operations (eg. root_squash, all_squash)?
+
+ Export options that translate user IDs are managed by nfsd_setuser()
+ which is called by nfsd_setuser_and_check_port() which is called by
+ __fh_verify(). So they get handled exactly the same way for LOCALIO
+ as they do for non-LOCALIO.
+
+9. How does LOCALIO make certain that object lifetimes are managed
+ properly given NFSD and NFS operate in different contexts?
+
+ See the detailed "NFS Client and Server Interlock" section below.
+
+RPC
+===
+
+The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
+RPC method that allows the Linux NFS client to verify the local Linux
+NFS server can see the nonce (single-use UUID) the client generated and
+made available in nfs_common. This protocol isn't part of an IETF
+standard, nor does it need to be considering it is Linux-to-Linux
+auxiliary RPC protocol that amounts to an implementation detail.
+
+The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
+the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
+XDR methods are used instead of the less efficient variable sized
+methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
+by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
+Linux Kernel Organization 400122 nfslocalio
+
+The LOCALIO protocol spec in rpcgen syntax is::
+
+ /* raw RFC 9562 UUID */
+ #define UUID_SIZE 16
+ typedef u8 uuid_t<UUID_SIZE>;
+
+ program NFS_LOCALIO_PROGRAM {
+ version LOCALIO_V1 {
+ void
+ NULL(void) = 0;
+
+ void
+ UUID_IS_LOCAL(uuid_t) = 1;
+ } = 1;
+ } = 400122;
+
+LOCALIO uses the same transport connection as NFS traffic. As such,
+LOCALIO is not registered with rpcbind.
+
+NFS Common and Client/Server Handshake
+======================================
+
+fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
+to generate a nonce (single-use UUID) and associated short-lived
+nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
+verification by the NFS server and if matched the NFS server populates
+members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
+transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
+clients_list from the nfs_common's uuids_list. See:
+fs/nfs/localio.c:nfs_local_probe()
+
+nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
+it has members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access). It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client and Server Interlock
+===============================
+
+LOCALIO provides the nfs_uuid_t object and associated interfaces to
+allow proper network namespace (net-ns) and NFSD object refcounting.
+
+LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref
+to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure
+each net-ns is not destroyed while in use by nfsd_open_local_fh(), and
+warrants a more detailed explanation:
+
+ nfsd_open_local_fh() uses nfsd_net_try_get() before opening its
+ nfsd_file handle and then the caller (NFS client) must drop the
+ reference for the nfsd_file and associated net-ns using
+ nfsd_file_put_local() once it has completed its IO.
+
+ This interlock working relies heavily on nfsd_open_local_fh() being
+ afforded the ability to safely deal with the possibility that the
+ NFSD's net-ns (and nfsd_net by association) may have been destroyed
+ by nfsd_destroy_serv() via nfsd_shutdown_net().
+
+This interlock of the NFS client and server has been verified to fix an
+easy to hit crash that would occur if an NFSD instance running in a
+container, with a LOCALIO client mounted, is shutdown. Upon restart of
+the container and associated NFSD, the client would go on to crash due
+to NULL pointer dereference that occurred due to the LOCALIO client's
+attempting to nfsd_open_local_fh() without having a proper reference on
+NFSD's net-ns.
+
+NFS Client issues IO instead of Server
+======================================
+
+Because LOCALIO is focused on protocol bypass to achieve improved IO
+performance, alternatives to the traditional NFS wire protocol (SUNRPC
+with XDR) must be provided to access the backing filesystem.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of select nfs server objects to allow a client local to a
+server to open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the associated nfsd network namespace and nn->nfsd_serv in terms of
+RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
+nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again. This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a LOCALIO client is connected to it.
+
+Once the client has an open nfsd_file pointer it will issue reads,
+writes and commits directly to the underlying local filesystem (normally
+done by the nfs server). As such, for these operations, the NFS client
+is issuing IO to the underlying local filesystem that it is sharing with
+the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
+fs/nfs/localio.c:nfs_local_commit().
+
+With normal NFS that makes use of RPC to issue IO to the server, if an
+application uses O_DIRECT the NFS client will bypass the pagecache but
+the NFS server will not. The NFS server's use of buffered IO affords
+applications to be less precise with their alignment when issuing IO to
+the NFS client. But if all applications properly align their IO, LOCALIO
+can be configured to use end-to-end O_DIRECT semantics from the NFS
+client to the underlying local filesystem, that it is sharing with
+the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module
+parameter to Y, e.g.:
+
+ echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
+
+Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics
+(but again, this may cause IO to fail if applications do not properly
+align their IO).
+
+Security
+========
+
+LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka
+AUTH_SYS) is used.
+
+Care is taken to ensure the same NFS security mechanisms are used
+(authentication, etc) regardless of whether LOCALIO or regular NFS
+access is used. The auth_domain established as part of the traditional
+NFS client access to the NFS server is also used for LOCALIO.
+
+Relative to containers, LOCALIO gives the client access to the network
+namespace the server has. This is required to allow the client to access
+the server's per-namespace nfsd_net struct. With traditional NFS, the
+client is afforded this same level of access (albeit in terms of the NFS
+protocol via SUNRPC). No other namespaces (user, mount, etc) have been
+altered or purposely extended from the server to the client.
+
+Module Parameters
+=================
+
+/sys/module/nfs/parameters/localio_enabled (bool)
+controls if LOCALIO is enabled, defaults to Y. If client and server are
+local but 'localio_enabled' is set to N then LOCALIO will not be used.
+
+/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)
+controls if O_DIRECT extends down to the underlying filesystem, defaults
+to N. Application IO must be logical blocksize aligned, otherwise
+O_DIRECT will fail.
+
+/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)
+controls if NFSv3 read and write IOs will trigger (re)enabling of
+LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0
+(disabled). Must be power-of-2, admin keeps all the pieces if they
+misconfigure (too low a value or non-power-of-2).
+
+Testing
+=======
+
+The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
+and commit access have proven stable against various test scenarios:
+
+- Client and server both on the same host.
+
+- All permutations of client and server support enablement for both
+ local and remote client and server.
+
+- Testing against NFS storage products that don't support the LOCALIO
+ protocol was also performed.
+
+- Client on host, server within a container (for both v3 and v4.2).
+ The container testing was in terms of podman managed containers and
+ includes successful container stop/restart scenario.
+
+- Formalizing these test scenarios in terms of existing test
+ infrastructure is on-going. Initial regular coverage is provided in
+ terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
+ mount configuration, and includes lockdep and KASAN coverage, see:
+ https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
+ https://github.com/koverstreet/ktest
+
+- Various kdevops testing (in terms of "Chuck's BuildBot") has been
+ performed to regularly verify the LOCALIO changes haven't caused any
+ regressions to non-LOCALIO NFS use cases.
+
+- All of Hammerspace's various sanity tests pass with LOCALIO enabled
+ (this includes numerous pNFS and flexfiles tests).
diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst
deleted file mode 100644
index 5bb093a26485..000000000000
--- a/Documentation/filesystems/ntfs.rst
+++ /dev/null
@@ -1,466 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================================
-The Linux NTFS filesystem driver
-================================
-
-
-.. Table of contents
-
- - Overview
- - Web site
- - Features
- - Supported mount options
- - Known bugs and (mis-)features
- - Using NTFS volume and stripe sets
- - The Device-Mapper driver
- - The Software RAID / MD driver
- - Limitations when using the MD driver
-
-
-Overview
-========
-
-Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
-These include mkntfs, a full-featured ntfs filesystem format utility,
-ntfsundelete used for recovering files that were unintentionally deleted
-from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
-See the web site for more information.
-
-To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
-system type 'ntfs'. The driver currently supports read-only mode (with no
-fault-tolerance, encryption or journalling) and very limited, but safe, write
-support.
-
-For fault tolerance and raid support (i.e. volume and stripe sets), you can
-use the kernel's Software RAID / MD driver. See section "Using Software RAID
-with NTFS" for details.
-
-
-Web site
-========
-
-There is plenty of additional information on the linux-ntfs web site
-at http://www.linux-ntfs.org/
-
-The web site has a lot of additional information, such as a comprehensive
-FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
-userspace utilities, etc.
-
-
-Features
-========
-
-- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
- earlier kernels. This new driver implements NTFS read support and is
- functionally equivalent to the old ntfs driver and it also implements limited
- write support. The biggest limitation at present is that files/directories
- cannot be created or deleted. See below for the list of write features that
- are so far supported. Another limitation is that writing to compressed files
- is not implemented at all. Also, neither read nor write access to encrypted
- files is so far implemented.
-- The new driver has full support for sparse files on NTFS 3.x volumes which
- the old driver isn't happy with.
-- The new driver supports execution of binaries due to mmap() now being
- supported.
-- The new driver supports loopback mounting of files on NTFS which is used by
- some Linux distributions to enable the user to run Linux from an NTFS
- partition by creating a large file while in Windows and then loopback
- mounting the file while in Linux and creating a Linux filesystem on it that
- is used to install Linux on it.
-- A comparison of the two drivers using::
-
- time find . -type f -exec md5sum "{}" \;
-
- run three times in sequence with each driver (after a reboot) on a 1.4GiB
- NTFS partition, showed the new driver to be 20% faster in total time elapsed
- (from 9:43 minutes on average down to 7:53). The time spent in user space
- was unchanged but the time spent in the kernel was decreased by a factor of
- 2.5 (from 85 CPU seconds down to 33).
-- The driver does not support short file names in general. For backwards
- compatibility, we implement access to files using their short file names if
- they exist. The driver will not create short file names however, and a
- rename will discard any existing short file name.
-- The new driver supports exporting of mounted NTFS volumes via NFS.
-- The new driver supports async io (aio).
-- The new driver supports fsync(2), fdatasync(2), and msync(2).
-- The new driver supports readv(2) and writev(2).
-- The new driver supports access time updates (including mtime and ctime).
-- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
- only very limited support for highly fragmented files, i.e. ones which have
- their data attribute split across multiple extents, is included. Another
- limitation is that at present truncate(2) will never create sparse files,
- since to mark a file sparse we need to modify the directory entry for the
- file and we do not implement directory modifications yet.
-- The new driver supports write(2) which can both overwrite existing data and
- extend the file size so that you can write beyond the existing data. Also,
- writing into sparse regions is supported and the holes are filled in with
- clusters. But at present only limited support for highly fragmented files,
- i.e. ones which have their data attribute split across multiple extents, is
- included. Another limitation is that write(2) will never create sparse
- files, since to mark a file sparse we need to modify the directory entry for
- the file and we do not implement directory modifications yet.
-
-Supported mount options
-=======================
-
-In addition to the generic mount options described by the manual page for the
-mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
-following mount options:
-
-======================= =======================================================
-iocharset=name Deprecated option. Still supported but please use
- nls=name in the future. See description for nls=name.
-
-nls=name Character set to use when returning file names.
- Unlike VFAT, NTFS suppresses names that contain
- unconvertible characters. Note that most character
- sets contain insufficient characters to represent all
- possible Unicode characters that can exist on NTFS.
- To be sure you are not missing any files, you are
- advised to use nls=utf8 which is capable of
- representing all Unicode characters.
-
-utf8=<bool> Option no longer supported. Currently mapped to
- nls=utf8 but please use nls=utf8 in the future and
- make sure utf8 is compiled either as module or into
- the kernel. See description for nls=name.
-
-uid=
-gid=
-umask= Provide default owner, group, and access mode mask.
- These options work as documented in mount(8). By
- default, the files/directories are owned by root and
- he/she has read and write permissions, as well as
- browse permission for directories. No one else has any
- access permissions. I.e. the mode on all files is by
- default rw------- and for directories rwx------, a
- consequence of the default fmask=0177 and dmask=0077.
- Using a umask of zero will grant all permissions to
- everyone, i.e. all files and directories will have mode
- rwxrwxrwx.
-
-fmask=
-dmask= Instead of specifying umask which applies both to
- files and directories, fmask applies only to files and
- dmask only to directories.
-
-sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
- Otherwise the default behaviour is to abort mount if
- any unknown options are found.
-
-show_sys_files=<BOOL> If show_sys_files is specified, show the system files
- in directory listings. Otherwise the default behaviour
- is to hide the system files.
- Note that even when show_sys_files is specified, "$MFT"
- will not be visible due to bugs/mis-features in glibc.
- Further, note that irrespective of show_sys_files, all
- files are accessible by name, i.e. you can always do
- "ls -l \$UpCase" for example to specifically show the
- system file containing the Unicode upcase table.
-
-case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
- case sensitive and create file names in the POSIX
- namespace. Otherwise the default behaviour is to treat
- file names as case insensitive and to create file names
- in the WIN32/LONG name space. Note, the Linux NTFS
- driver will never create short file names and will
- remove them on rename/delete of the corresponding long
- file name.
- Note that files remain accessible via their short file
- name, if it exists. If case_sensitive, you will need
- to provide the correct case of the short file name.
-
-disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
- regions, i.e. holes, inside files is disabled for the
- volume (for the duration of this mount only). By
- default, creation of sparse regions is enabled, which
- is consistent with the behaviour of traditional Unix
- filesystems.
-
-errors=opt What to do when critical filesystem errors are found.
- Following values can be used for "opt":
-
- ======== =========================================
- continue DEFAULT, try to clean-up as much as
- possible, e.g. marking a corrupt inode as
- bad so it is no longer accessed, and then
- continue.
- recover At present only supported is recovery of
- the boot sector from the backup copy.
- If read-only mount, the recovery is done
- in memory only and not written to disk.
- ======== =========================================
-
- Note that the options are additive, i.e. specifying::
-
- errors=continue,errors=recover
-
- means the driver will attempt to recover and if that
- fails it will clean-up as much as possible and
- continue.
-
-mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
- setting is not persistent across mounts and can be
- changed from mount to mount but cannot be changed on
- remount). Values of 1 to 4 are allowed, 1 being the
- default. The MFT zone multiplier determines how much
- space is reserved for the MFT on the volume. If all
- other space is used up, then the MFT zone will be
- shrunk dynamically, so this has no impact on the
- amount of free space. However, it can have an impact
- on performance by affecting fragmentation of the MFT.
- In general use the default. If you have a lot of small
- files then use a higher value. The values have the
- following meaning:
-
- ===== =================================
- Value MFT zone size (% of volume size)
- ===== =================================
- 1 12.5%
- 2 25%
- 3 37.5%
- 4 50%
- ===== =================================
-
- Note this option is irrelevant for read-only mounts.
-======================= =======================================================
-
-
-Known bugs and (mis-)features
-=============================
-
-- The link count on each directory inode entry is set to 1, due to Linux not
- supporting directory hard links. This may well confuse some user space
- applications, since the directory names will have the same inode numbers.
- This also speeds up ntfs_read_inode() immensely. And we haven't found any
- problems with this approach so far. If you find a problem with this, please
- let us know.
-
-
-Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
-list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
-
-
-Using NTFS volume and stripe sets
-=================================
-
-For support of volume and stripe sets, you can either use the kernel's
-Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
-the recommended one to use for linear raid. But the latter is required for
-raid level 5. For striping and mirroring, either driver should work fine.
-
-
-The Device-Mapper driver
-------------------------
-
-You will need to create a table of the components of the volume/stripe set and
-how they fit together and load this into the kernel using the dmsetup utility
-(see man 8 dmsetup).
-
-Linear volume sets, i.e. linear raid, has been tested and works fine. Even
-though untested, there is no reason why stripe sets, i.e. raid level 0, and
-mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
-raid level 5, unfortunately cannot work yet because the current version of the
-Device-Mapper driver does not support raid level 5. You may be able to use the
-Software RAID / MD driver for raid level 5, see the next section for details.
-
-To create the table describing your volume you will need to know each of its
-components and their sizes in sectors, i.e. multiples of 512-byte blocks.
-
-For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
-example if one of your partitions is /dev/hda2 you would do::
-
- $ fdisk -ul /dev/hda
-
- Disk /dev/hda: 81.9 GB, 81964302336 bytes
- 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
- Units = sectors of 1 * 512 = 512 bytes
-
- Device Boot Start End Blocks Id System
- /dev/hda1 * 63 4209029 2104483+ 83 Linux
- /dev/hda2 4209030 37768814 16779892+ 86 NTFS
- /dev/hda3 37768815 46170809 4200997+ 83 Linux
-
-And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
-33559785 sectors.
-
-For Win2k and later dynamic disks, you can for example use the ldminfo utility
-which is part of the Linux LDM tools (the latest version at the time of
-writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
-
- http://www.linux-ntfs.org/
-
-Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
-into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
-will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
-able to compile this yourself easily so use the binary version!
-
-Then you would use ldminfo in dump mode to obtain the necessary information::
-
- $ ./ldminfo --dump /dev/hda
-
-This would dump the LDM database found on /dev/hda which describes all of your
-dynamic disks and all the volumes on them. At the bottom you will see the
-VOLUME DEFINITIONS section which is all you really need. You may need to look
-further above to determine which of the disks in the volume definitions is
-which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
-look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
-section). You can then find these Disk Ids in the VBLK DATABASE section in the
-<Disk> components where you will get the LDM Name for the disk that is found in
-the VOLUME DEFINITIONS section.
-
-Note you will also need to enable the LDM driver in the Linux kernel. If your
-distribution did not enable it, you will need to recompile the kernel with it
-enabled. This will create the LDM partitions on each device at boot time. You
-would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
-in the Device-Mapper table.
-
-You can also bypass using the LDM driver by using the main device (e.g.
-/dev/hda) and then using the offsets of the LDM partitions into this device as
-the "Start sector of device" when creating the table. Once again ldminfo would
-give you the correct information to do this.
-
-Assuming you know all your devices and their sizes things are easy.
-
-For a linear raid the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset into Size of this Raid type Device Start sector
- # volume device of device
- 0 1028161 linear /dev/hda1 0
- 1028161 3903762 linear /dev/hdb2 0
- 4931923 2103211 linear /dev/hdc1 0
-
-For a striped volume, i.e. raid level 0, you will need to know the chunk size
-you used when creating the volume. Windows uses 64kiB as the default, so it
-will probably be this unless you changes the defaults when creating the array.
-
-For a raid level 0 the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset Size Raid Number Chunk 1st Start 2nd Start
- # into of the type of size Device in Device in
- # volume volume stripes device device
- 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
-
-If there are more than two devices, just add each of them to the end of the
-line.
-
-Finally, for a mirrored volume, i.e. raid level 1, the table would look like
-this (note all values are in 512-byte sectors)::
-
- # Ofs Size Raid Log Number Region Should Number Source Start Target Start
- # in of the type type of log size sync? of Device in Device in
- # vol volume params mirrors Device Device
- 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
-
-If you are mirroring to multiple devices you can specify further targets at the
-end of the line.
-
-Note the "Should sync?" parameter "nosync" means that the two mirrors are
-already in sync which will be the case on a clean shutdown of Windows. If the
-mirrors are not clean, you can specify the "sync" option instead of "nosync"
-and the Device-Mapper driver will then copy the entirety of the "Source Device"
-to the "Target Device" or if you specified multiple target devices to all of
-them.
-
-Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
-and hand it over to dmsetup to work with, like so::
-
- $ dmsetup create myvolume1 /etc/ntfsvolume1
-
-You can obviously replace "myvolume1" with whatever name you like.
-
-If it all worked, you will now have the device /dev/device-mapper/myvolume1
-which you can then just use as an argument to the mount command as usual to
-mount the ntfs volume. For example::
-
- $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
-
-(You need to create the directory /mnt/myvol1 first and of course you can use
-anything you like instead of /mnt/myvol1 as long as it is an existing
-directory.)
-
-It is advisable to do the mount read-only to see if the volume has been setup
-correctly to avoid the possibility of causing damage to the data on the ntfs
-volume.
-
-
-The Software RAID / MD driver
------------------------------
-
-An alternative to using the Device-Mapper driver is to use the kernel's
-Software RAID / MD driver. For which you need to set up your /etc/raidtab
-appropriately (see man 5 raidtab).
-
-Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
-0, have been tested and work fine (though see section "Limitations when using
-the MD driver with NTFS volumes" especially if you want to use linear raid).
-Even though untested, there is no reason why mirrors, i.e. raid level 1, and
-stripes with parity, i.e. raid level 5, should not work, too.
-
-You have to use the "persistent-superblock 0" option for each raid-disk in the
-NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
-superblock used by the MD driver would damage the NTFS volume.
-
-Windows by default uses a stripe chunk size of 64k, so you probably want the
-"chunk-size 64k" option for each raid-disk, too.
-
-For example, if you have a stripe set consisting of two partitions /dev/hda5
-and /dev/hdb1 your /etc/raidtab would look like this::
-
- raiddev /dev/md0
- raid-level 0
- nr-raid-disks 2
- nr-spare-disks 0
- persistent-superblock 0
- chunk-size 64k
- device /dev/hda5
- raid-disk 0
- device /dev/hdb1
- raid-disk 1
-
-For linear raid, just change the raid-level above to "raid-level linear", for
-mirrors, change it to "raid-level 1", and for stripe sets with parity, change
-it to "raid-level 5".
-
-Note for stripe sets with parity you will also need to tell the MD driver
-which parity algorithm to use by specifying the option "parity-algorithm
-which", where you need to replace "which" with the name of the algorithm to
-use (see man 5 raidtab for available algorithms) and you will have to try the
-different available algorithms until you find one that works. Make sure you
-are working read-only when playing with this as you may damage your data
-otherwise. If you find which algorithm works please let us know (email the
-linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
-IRC in channel #ntfs on the irc.freenode.net network) so we can update this
-documentation.
-
-Once the raidtab is setup, run for example raid0run -a to start all devices or
-raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
-
-Then just use the mount command as usual to mount the ntfs volume using for
-example::
-
- mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
-
-It is advisable to do the mount read-only to see if the md volume has been
-setup correctly to avoid the possibility of causing damage to the data on the
-ntfs volume.
-
-
-Limitations when using the Software RAID / MD driver
------------------------------------------------------
-
-Using the md driver will not work properly if any of your NTFS partitions have
-an odd number of sectors. This is especially important for linear raid as all
-data after the first partition with an odd number of sectors will be offset by
-one or more sectors so if you mount such a partition with write support you
-will cause massive damage to the data on the volume which will only become
-apparent when you try to use the volume again under Windows.
-
-So when using linear raid, make sure that all your partitions have an even
-number of sectors BEFORE attempting to use it. You have been warned!
-
-Even better is to simply use the Device-Mapper for linear raid and then you do
-not have this problem with odd numbers of sectors.
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 165514401441..6245b67ae9e0 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -156,7 +156,7 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
-An opaque directory should not conntain any whiteouts, because they do not
+An opaque directory should not contain any whiteouts, because they do not
serve any purpose. A merge directory containing regular files with the xattr
"trusted.overlay.whiteout", should be additionally marked by setting the xattr
"trusted.overlay.opaque" to "x" on the merge directory itself.
@@ -266,7 +266,7 @@ Non-directories
Objects that are not directories (files, symlinks, device-special
files etc.) are presented either from the upper or lower filesystem as
appropriate. When a file in the lower filesystem is accessed in a way
-the requires write-access, such as opening for write access, changing
+that requires write-access, such as opening for write access, changing
some metadata etc., the file is first copied from the lower filesystem
to the upper filesystem (copy_up). Note that creating a hard-link
also requires copy_up, though of course creation of a symlink does
@@ -367,8 +367,11 @@ Metadata only copy up
When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
-like chown/chmod is performed. Full file will be copied up later when
-file is opened for WRITE operation.
+like chown/chmod is performed. An upper file in this state is marked with
+"trusted.overlayfs.metacopy" xattr which indicates that the upper file
+contains no data. The data will be copied up later when file is opened for
+WRITE operation. After the lower file's data is copied up,
+the "trusted.overlayfs.metacopy" xattr is removed from the upper file.
In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.
@@ -437,6 +440,23 @@ For example::
fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+Specifying layers via file descriptors
+--------------------------------------
+
+Since kernel v6.13, overlayfs supports specifying layers via file descriptors in
+addition to specifying them as paths. This feature is available for the
+"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the
+fsconfig syscall from the new mount api::
+
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper);
+
+
fs-verity support
-----------------
@@ -529,8 +549,8 @@ Nesting overlayfs mounts
It is possible to use a lower directory that is stored on an overlayfs
mount. For regular files this does not need any special care. However, files
-that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs will be
-interpreted by the underlying overlayfs mount and stripped out. In order to
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs, will
+be interpreted by the underlying overlayfs mount and stripped out. In order to
allow the second overlayfs mount to see the attributes they must be escaped.
Overlayfs specific xattrs are escaped by using a special prefix of
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 2b2df6aa5432..9ced1135608e 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -531,7 +531,7 @@ this retry process in the next article.
Automount points are locations in the filesystem where an attempt to
lookup a name can trigger changes to how that lookup should be
handled, in particular by mounting a filesystem there. These are
-covered in greater detail in autofs.txt in the Linux documentation
+covered in greater detail in autofs.rst in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 1aa7ce099f6f..d2cf2852e1f8 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -379,4 +379,4 @@ Papers and other documentation on dcache locking
2. http://lse.sourceforge.net/locking/dcache/dcache.html
-3. path-lookup.md in this directory.
+3. path-lookup.rst in this directory.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 1be76ef117b3..1639e78e3146 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -177,7 +177,7 @@ settles down a bit.
**mandatory**
s_export_op is now required for exporting a filesystem.
-isofs, ext2, ext3, reiserfs, fat
+isofs, ext2, ext3, fat
can be used as examples of very different filesystems.
---
@@ -313,7 +313,7 @@ done.
**mandatory**
-block truncatation on error exit from ->write_begin, and ->direct_IO
+block truncation on error exit from ->write_begin, and ->direct_IO
moved from generic methods (block_write_begin, cont_write_begin,
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
ext2_write_failed and callers for an example.
@@ -858,7 +858,7 @@ be misspelled d_alloc_anon().
**mandatory**
-[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+[should've been added in 2016] stale comment in finish_open() notwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
@@ -989,7 +989,7 @@ This mechanism would only work for a single device so the block layer couldn't
find the owning superblock of any additional devices.
In the old mechanism reusing or creating a superblock for a racing mount(2) and
-umount(2) relied on the file_system_type as the holder. This was severly
+umount(2) relied on the file_system_type as the holder. This was severely
underdocumented however:
(1) Any concurrent mounter that managed to grab an active reference on an
@@ -1134,3 +1134,26 @@ superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
device freezing now works for any block device owned by a given superblock, not
just the main block device. The get_active_super() helper and bd_fsfreeze_sb
pointer are gone.
+
+---
+
+**mandatory**
+
+set_blocksize() takes opened struct file instead of struct block_device now
+and it *must* be opened exclusive.
+
+---
+
+** mandatory**
+
+->d_revalidate() gets two extra arguments - inode of parent directory and
+name our dentry is expected to have. Both are stable (dir is pinned in
+non-RCU case and will stay around during the call in RCU case, and name
+is guaranteed to stay unchanging). Your instance doesn't have to use
+either, but it often helps to avoid a lot of painful boilerplate.
+Note that while name->name is stable and NUL-terminated, it may (and
+often will) have name->name[name->len] equal to '/' rather than '\0' -
+in normal case it points into the pathname being looked up.
+NOTE: if you need something like full path from the root of filesystem,
+you are still on your own - this assists with simple cases, but it's not
+magic.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 104c6d047d9b..09f0aed5a08b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -48,6 +48,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
3.13 /proc/<pid>/fd - List of symlinks to open files
+ 3.14 /proc/<pid/ksm_stat - Information about the process's ksm status.
4 Configuring procfs
4.1 Mount options
@@ -443,6 +444,15 @@ is not associated with a file:
or if empty, the mapping is anonymous.
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs. This interface is binary and is meant for more
+efficient and easy programmatic use. `struct procmap_query`, defined in
+linux/fs.h UAPI header, serves as an input/output argument to the
+`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for
+details on query semantics, supported flags, data returned, and general API
+usage information.
+
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following::
@@ -475,14 +485,15 @@ Memory Area, or VMA) there is a series of lines such as the following::
THPeligible: 0
VmFlags: rd ex mr mw me dw
-The first of these lines shows the same information as is displayed for the
-mapping in /proc/PID/maps. Following lines show the size of the mapping
-(size); the size of each page allocated when backing a VMA (KernelPageSize),
-which is usually the same as the size in the page table entries; the page size
-used by the MMU when backing a VMA (in most cases, the same as KernelPageSize);
-the amount of the mapping that is currently resident in RAM (RSS); the
-process' proportional share of this mapping (PSS); and the number of clean and
-dirty shared and private pages in the mapping.
+The first of these lines shows the same information as is displayed for
+the mapping in /proc/PID/maps. Following lines show the size of the
+mapping (size); the size of each page allocated when backing a VMA
+(KernelPageSize), which is usually the same as the size in the page table
+entries; the page size used by the MMU when backing a VMA (in most cases,
+the same as KernelPageSize); the amount of the mapping that is currently
+resident in RAM (RSS); the process's proportional share of this mapping
+(PSS); and the number of clean and dirty shared and private pages in the
+mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
@@ -570,7 +581,8 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
- ss shadow stack page
+ ss shadow/guarded control stack page
+ sl sealed
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
@@ -688,6 +700,7 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
bootconfig Kernel command line obtained from boot config,
and, if there were kernel parameters from the
@@ -953,6 +966,35 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
+allocinfo
+~~~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module (if originates from a loadable module) and the function calling
+the allocation. The number of bytes allocated and number of calls at each
+location are reported. The first line indicates the version of the file, the
+second line is the header listing fields in the file.
+
+Example output.
+
+::
+
+ > tail -n +3 /proc/allocinfo | sort -rn
+ 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
+ 56373248 4737 mm/slub.c:2259 func:alloc_slab_page
+ 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
+ 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
+ 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
+ 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
+ 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
+ 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
+ 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
+ 3940352 962 mm/memory.c:4214 func:alloc_anon_folio
+ 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
+ ...
+
+
meminfo
~~~~~~~
@@ -1110,8 +1152,8 @@ KernelStack
PageTables
Memory consumed by userspace page tables
SecPageTables
- Memory consumed by secondary page tables, this currently
- currently includes KVM mmu allocations on x86 and arm64.
+ Memory consumed by secondary page tables, this currently includes
+ KVM mmu and IOMMU allocations on x86 and arm64.
NFS_Unstable
Always zero. Previous counted pages which had been written to
the server, but has not been committed to stable storage.
@@ -1899,8 +1941,8 @@ For more information on mount propagation see:
These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
-then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
-comm value.
+then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
+terminator) will result in a truncated comm value.
3.7 /proc/<pid>/task/<tid>/children - Information about task children
@@ -2192,6 +2234,74 @@ The number of open files for the process is stored in 'size' member
of stat() output for /proc/<pid>/fd for fast access.
-------------------------------------------------------
+3.14 /proc/<pid/ksm_stat - Information about the process's ksm status
+---------------------------------------------------------------------
+When CONFIG_KSM is enabled, each process has this file which displays
+the information of ksm merging status.
+
+Example
+~~~~~~~
+
+::
+
+ / # cat /proc/self/ksm_stat
+ ksm_rmap_items 0
+ ksm_zero_pages 0
+ ksm_merging_pages 0
+ ksm_process_profit 0
+ ksm_merge_any: no
+ ksm_mergeable: no
+
+Description
+~~~~~~~~~~~
+
+ksm_rmap_items
+^^^^^^^^^^^^^^
+
+The number of ksm_rmap_item structures in use. The structure
+ksm_rmap_item stores the reverse mapping information for virtual
+addresses. KSM will generate a ksm_rmap_item for each ksm-scanned page of
+the process.
+
+ksm_zero_pages
+^^^^^^^^^^^^^^
+
+When /sys/kernel/mm/ksm/use_zero_pages is enabled, it represent how many
+empty pages are merged with kernel zero pages by KSM.
+
+ksm_merging_pages
+^^^^^^^^^^^^^^^^^
+
+It represents how many pages of this process are involved in KSM merging
+(not including ksm_zero_pages). It is the same with what
+/proc/<pid>/ksm_merging_pages shows.
+
+ksm_process_profit
+^^^^^^^^^^^^^^^^^^
+
+The profit that KSM brings (Saved bytes). KSM can save memory by merging
+identical pages, but also can consume additional memory, because it needs
+to generate a number of rmap_items to save each scanned page's brief rmap
+information. Some of these pages may be merged, but some may not be abled
+to be merged after being checked several times, which are unprofitable
+memory consumed.
+
+ksm_merge_any
+^^^^^^^^^^^^^
+
+It specifies whether the process's 'mm is added by prctl() into the
+candidate list of KSM or not, and if KSM scanning is fully enabled at
+process level.
+
+ksm_mergeable
+^^^^^^^^^^^^^
+
+It specifies whether any VMAs of the process''s mms are currently
+applicable to KSM.
+
+More information about KSM can be found in
+Documentation/admin-guide/mm/ksm.rst.
+
Chapter 4: Configuring procfs
=============================
@@ -2221,7 +2331,7 @@ arguments are now protected against local eavesdroppers.
hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
fully invisible to other users. It doesn't mean that it hides a fact whether a
process with a specific pid value exists (it can be learned by other means, e.g.
-by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+by "kill -0 $PID"), but it hides process's uid and gid, which may be learned by
stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
gathering information about running processes, whether some daemon runs with
elevated privileges, whether other user runs some sensitive program, whether
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 447f767c6462..fa4f81099cb4 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -315,7 +315,7 @@ the above threads) is:
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
- explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ explained in buffer-format.rst, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
index 6b30e43a0d11..67cb68ea6e68 100644
--- a/Documentation/filesystems/smb/ksmbd.rst
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -13,7 +13,7 @@ KSMBD architecture
The subset of performance related operations belong in kernelspace and
the other subset which belong to operations which are not really related with
performance in userspace. So, DCE/RPC management that has historically resulted
-into number of buffer overflow issues and dangerous security bugs and user
+into a number of buffer overflow issues and dangerous security bugs and user
account management are implemented in user space as ksmbd.mountd.
File operations that are related with performance (open/read/write/close etc.)
in kernel space (ksmbd). This also allows for easier integration with VFS
@@ -24,8 +24,8 @@ ksmbd (kernel daemon)
When the server daemon is started, It starts up a forker thread
(ksmbd/interface name) at initialization time and open a dedicated port 445
-for listening to SMB requests. Whenever new clients make request, Forker
-thread will accept the client connection and fork a new thread for dedicated
+for listening to SMB requests. Whenever new clients make a request, the Forker
+thread will accept the client connection and fork a new thread for a dedicated
communication channel between the client and the server. It allows for parallel
processing of SMB requests(commands) from clients as well as allowing for new
clients to make new connections. Each instance is named ksmbd/1~n(port number)
@@ -34,12 +34,12 @@ thread can decide to pass through the commands to the user space (ksmbd.mountd),
currently DCE/RPC commands are identified to be handled through the user space.
To further utilize the linux kernel, it has been chosen to process the commands
as workitems and to be executed in the handlers of the ksmbd-io kworker threads.
-It allows for multiplexing of the handlers as the kernel take care of initiating
+It allows for multiplexing of the handlers as the kernel takes care of initiating
extra worker threads if the load is increased and vice versa, if the load is
-decreased it destroys the extra worker threads. So, after connection is
-established with client. Dedicated ksmbd/1..n(port number) takes complete
+decreased it destroys the extra worker threads. So, after the connection is
+established with the client. Dedicated ksmbd/1..n(port number) takes complete
ownership of receiving/parsing of SMB commands. Each received command is worked
-in parallel i.e., There can be multiple clients commands which are worked in
+in parallel i.e., there can be multiple client commands which are worked in
parallel. After receiving each command a separated kernel workitem is prepared
for each command which is further queued to be handled by ksmbd-io kworkers.
So, each SMB workitem is queued to the kworkers. This allows the benefit of load
@@ -49,9 +49,9 @@ performance by handling client commands in parallel.
ksmbd.mountd (user space daemon)
--------------------------------
-ksmbd.mountd is userspace process to, transfer user account and password that
+ksmbd.mountd is a userspace process to, transfer the user account and password that
are registered using ksmbd.adduser (part of utils for user space). Further it
-allows sharing information parameters that parsed from smb.conf to ksmbd in
+allows sharing information parameters that are parsed from smb.conf to ksmbd in
kernel. For the execution part it has a daemon which is continuously running
and connected to the kernel interface using netlink socket, it waits for the
requests (dcerpc and share/user info). It handles RPC calls (at a minimum few
@@ -124,7 +124,7 @@ How to run
1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and
compile them.
- - Refer README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
+ - Refer to README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
to know how to use ksmbd.mountd/adduser/addshare/control utils
$ ./autogen.sh
@@ -133,7 +133,7 @@ How to run
2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file.
- - Refer ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
+ - Refer to ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
for details to configure shares.
$ man ksmbd.conf
@@ -145,7 +145,7 @@ How to run
$ man ksmbd.adduser
$ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access>
-4. Insert ksmbd.ko module after build your kernel. No need to load module
+4. Insert the ksmbd.ko module after you build your kernel. No need to load the module
if ksmbd is built into the kernel.
- Set ksmbd in menuconfig(e.g. $ make menuconfig)
@@ -175,7 +175,7 @@ Each layer
1. Enable all component prints
# sudo ksmbd.control -d "all"
-2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma)
+2. Enable one of the components (smb, auth, vfs, oplock, ipc, conn, rdma)
# sudo ksmbd.control -d "smb"
3. Show what prints are enabled.
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index 4af8d6207509..45653b3228f9 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -6,7 +6,7 @@ Squashfs 4.0 Filesystem
Squashfs is a compressed read-only filesystem for Linux.
-It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
+It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
maximum of 1Mbytes (default block size 128K).
@@ -16,8 +16,8 @@ use (i.e. in cases where a .tar.gz file may be used), and in constrained
block device/memory systems (e.g. embedded systems) where low overhead is
needed.
-Mailing list: squashfs-devel@lists.sourceforge.net
-Web site: www.squashfs.org
+Mailing list (kernel code): linux-fsdevel@vger.kernel.org
+Web site: github.com/plougher/squashfs-tools
1. Filesystem Features
----------------------
@@ -58,11 +58,9 @@ inodes have different sizes).
As squashfs is a read-only filesystem, the mksquashfs program must be used to
create populated squashfs filesystems. This and other squashfs utilities
-can be obtained from http://www.squashfs.org. Usage instructions can be
-obtained from this site also.
-
-The squashfs-tools development tree is now located on kernel.org
- git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+are very likely packaged by your linux distribution (called squashfs-tools).
+The source code can be obtained from github.com/plougher/squashfs-tools.
+Usage instructions can also be obtained from this site.
2.1 Mount options
-----------------
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe..d677e0428c3f 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -241,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
+tmpfs has the following mounting options for case-insensitive lookup support:
+
+================= ==============================================================
+casefold Enable casefold support at this mount point using the given
+ argument as the encoding standard. Currently only UTF-8
+ encodings are supported. If no argument is used, it will load
+ the latest UTF-8 encoding available.
+strict_encoding Enable strict encoding at this mount point (disabled by
+ default). In this mode, the filesystem refuses to create file
+ and directory with names containing invalid UTF-8 characters.
+================= ==============================================================
+
+This option doesn't render the entire filesystem case-insensitive. One needs to
+still set the casefold flag per directory, by flipping the +F attribute in an
+empty directory. Nevertheless, new directories will inherit the attribute. The
+mountpoint itself cannot be made case-insensitive.
+
+Example::
+
+ $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs
+ $ mount -t tmpfs -o casefold fs_name /mytmpfs
+
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
@@ -250,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
KOSAKI Motohiro, 16 Mar 2010
:Updated:
Chris Down, 13 July 2020
+:Updated:
+ André Almeida, 23 Aug 2024
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index eebcc0f9e2bc..31eea688609a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -810,7 +810,7 @@ cache in your filesystem. The following members are defined:
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -913,8 +913,7 @@ cache in your filesystem. The following members are defined:
stop attempting I/O, it can simply return. The caller will
remove the remaining pages from the address space, unlock them
and decrement the page refcount. Set PageUptodate if the I/O
- completes successfully. Setting PageError on any page will be
- ignored; simply unlock the page if an I/O error occurs.
+ completes successfully.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -926,12 +925,12 @@ cache in your filesystem. The following members are defined:
(if they haven't been read already) so that the updated blocks
can be written out properly.
- The filesystem must return the locked pagecache page for the
- specified offset, in ``*pagep``, for the caller to write into.
+ The filesystem must return the locked pagecache folio for the
+ specified offset, in ``*foliop``, for the caller to write into.
It must be able to cope with short writes (where the length
passed to write_begin is greater than the number of bytes copied
- into the page).
+ into the folio).
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -944,8 +943,8 @@ cache in your filesystem. The following members are defined:
called. len is the original len passed to write_begin, and
copied is the amount that was able to be copied.
- The filesystem must take care of unlocking the page and
- releasing it refcount, and updating i_size.
+ The filesystem must take care of unlocking the folio,
+ decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=
'copied') that were able to be copied into pagecache.
@@ -1252,7 +1251,8 @@ defined:
.. code-block:: c
struct dentry_operations {
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -1264,7 +1264,9 @@ defined:
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
};
``d_revalidate``
@@ -1419,16 +1421,33 @@ defined:
the dentry being transited from.
``d_real``
- overlay/union type filesystems implement this method to return
- one of the underlying dentries hidden by the overlay. It is
- used in two different modes:
+ overlay/union type filesystems implement this method to return one
+ of the underlying dentries of a regular file hidden by the overlay.
- Called from file_dentry() it returns the real dentry matching
- the inode argument. The real dentry may be from a lower layer
- already copied up, but still referenced from the file. This
- mode is selected with a non-NULL inode argument.
+ The 'type' argument takes the values D_REAL_DATA or D_REAL_METADATA
+ for returning the real underlying dentry that refers to the inode
+ hosting the file's data or metadata respectively.
+
+ For non-regular files, the 'dentry' argument is returned.
+
+``d_unalias_trylock``
+ if present, will be called by d_splice_alias() before moving a
+ preexisting attached alias. Returning false prevents __d_move(),
+ making d_splice_alias() fail with -ESTALE.
+
+ Rationale: setting FS_RENAME_DOES_D_MOVE will prevent d_move()
+ and d_exchange() calls from the outside of filesystem methods;
+ however, it does not guarantee that attached dentries won't
+ be renamed or moved by d_splice_alias() finding a preexisting
+ alias for a directory inode. Normally we would not care;
+ however, something that wants to stabilize the entire path to
+ root over a blocking operation might need that. See 9p for one
+ (and hopefully only) example.
+
+``d_unalias_unlock``
+ should be paired with ``d_unalias_trylock``; that one is called after
+ __d_move() call in __d_unalias().
- With NULL inode the topmost real underlying dentry is returned.
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 352516feef6f..12aa63840830 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -1915,19 +1915,13 @@ four of those five higher level data structures.
The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
study.
-The most general storage interface supported by the xfile enables the reading
-and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
-This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
-which behave similarly to their userspace counterparts.
XFS is very record-based, which suggests that the ability to load and store
complete records is important.
-To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
-functions are provided to read and persist objects into an xfile.
-They are internally the same as pread and pwrite, except that they treat any
-error as an out of memory error.
-For online repair, squashing error conditions in this manner is an acceptable
-behavior because the only reaction is to abort the operation back to userspace.
-All five xfile usecases can be serviced by these four functions.
+To support these cases, a pair of ``xfile_load`` and ``xfile_store``
+functions are provided to read and persist objects into an xfile that treat any
+error as an out of memory error. For online repair, squashing error conditions
+in this manner is an acceptable behavior because the only reaction is to abort
+the operation back to userspace.
However, no discussion of file access idioms is complete without answering the
question, "But what about mmap?"
@@ -1939,15 +1933,14 @@ tmpfs can only push a pagecache folio to the swap cache if the folio is neither
pinned nor locked, which means the xfile must not pin too many folios.
Short term direct access to xfile contents is done by locking the pagecache
-folio and mapping it into kernel address space.
-Programmatic access (e.g. pread and pwrite) uses this mechanism.
-Folio locks are not supposed to be held for long periods of time, so long
-term direct access to xfile contents is done by bumping the folio refcount,
+folio and mapping it into kernel address space. Object load and store uses this
+mechanism. Folio locks are not supposed to be held for long periods of time, so
+long term direct access to xfile contents is done by bumping the folio refcount,
mapping it into kernel address space, and dropping the folio lock.
These long term users *must* be responsive to memory reclaim by hooking into
the shrinker infrastructure to know when to release folios.
-The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
+The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
retrieve the (locked) folio that backs part of an xfile and to release it.
The only code to use these folio lease functions are the xfarray
:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
@@ -2174,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed.
The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.
@@ -2277,13 +2270,12 @@ follows:
pointing to the xfile.
3. Pass the buffer cache target, buffer ops, and other information to
- ``xfbtree_create`` to write an initial tree header and root block to the
- xfile.
+ ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
+ initial root block to the xfile.
Each btree type should define a wrapper that passes necessary arguments to
the creation function.
For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
all the necessary details for callers.
- A ``struct xfbtree`` object will be returned.
4. Pass the xfbtree object to the btree cursor creation function for the
btree type.
@@ -2810,7 +2802,8 @@ follows this format:
Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.
@@ -3859,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.
@@ -3870,10 +3863,11 @@ consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file. The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
There is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3982,8 +3976,8 @@ The proposed patches are in the
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
series.
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
@@ -4018,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.
The proposed patchset is the
-`atomic extent swap
+`file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
series.
@@ -4055,9 +4053,6 @@ series.
| one ``struct rw_semaphore`` for each feature. |
| The log cleaning code tries to take this rwsem in exclusive mode to |
| clear the bit; if the lock attempt fails, the feature bit remains set. |
-| Filesystem code signals its intention to use a log incompat feature in a |
-| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
-| in shared mode. |
| The code supporting a log incompat feature should create wrapper |
| functions to obtain the log feature and call |
| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
@@ -4072,72 +4067,73 @@ series.
| The feature bit will not be cleared from the superblock until the log |
| becomes clean. |
| |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the |
| functionality. |
+--------------------------------------------------------------------------+
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
.. code-block:: c
- struct xfs_swapext_intent {
+ struct xfs_exchmaps_intent {
/* Inodes participating in the operation. */
- struct xfs_inode *sxi_ip1;
- struct xfs_inode *sxi_ip2;
+ struct xfs_inode *xmi_ip1;
+ struct xfs_inode *xmi_ip2;
/* File offset range information. */
- xfs_fileoff_t sxi_startoff1;
- xfs_fileoff_t sxi_startoff2;
- xfs_filblks_t sxi_blockcount;
+ xfs_fileoff_t xmi_startoff1;
+ xfs_fileoff_t xmi_startoff2;
+ xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
- xfs_fsize_t sxi_isize1;
- xfs_fsize_t sxi_isize2;
+ xfs_fsize_t xmi_isize1;
+ xfs_fsize_t xmi_isize2;
- /* XFS_SWAP_EXT_* log operation flags */
- uint64_t sxi_flags;
+ /* XFS_EXCHMAPS_* log operation flags */
+ uint64_t xmi_flags;
};
The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
-
-When the extent swap is initiated, the sequence of operations is as follows:
-
-1. Create a deferred work item for the extent swap.
- At the start, it should contain the entirety of the file ranges to be
- swapped.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
+
+When the exchange is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the file mapping exchange.
+ At the start, it should contain the entirety of the file block ranges to be
+ exchanged.
2. Call ``xfs_defer_finish`` to process the exchange.
- This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+ This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
- extent swap work item.
+ mapping exchange work item.
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
- a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
- ``sxi_startoff2``, respectively, and compute the longest extent that can
- be swapped in a single step.
+ a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+ ``xmi_startoff2``, respectively, and compute the longest extent that can
+ be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
@@ -4159,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
g. Extend the ondisk size of either file if necessary.
- h. Log an extent swap done log item for the extent swap intent log item
- that was read at the start of step 3.
+ h. Log a mapping exchange done log item for th mapping exchange intent log
+ item that was read at the start of step 3.
i. Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
- sxi_startoff1)``, because step 3a could have skipped holes.
+ xmi_startoff1)``, because step 3a could have skipped holes.
- j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+ j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
- ``sxi_blockcount`` by the same quantity.
+ ``xmi_blockcount`` by the same quantity.
This advances the cursor.
- k. Log a new extent swap intent log item reflecting the advanced state of
- the work item.
+ k. Log a new mapping exchange intent log item reflecting the advanced state
+ of the work item.
l. Return the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
@@ -4183,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:
- Data device blocks needed to handle the repeated updates to the fork
@@ -4212,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
Extended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:
- If both forks are in local format and the fork areas are large enough, the
- swap is performed by copying the incore fork contents, logging both forks,
- and committing.
- The atomic extent swap mechanism is not necessary, since this can be done
- with a single transaction.
+ exchange is performed by copying the incore fork contents, logging both
+ forks, and committing.
+ The atomic file mapping exchange mechanism is not necessary, since this can
+ be done with a single transaction.
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+ used.
- Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
- swap.
+ exchange.
The conversion to block format must be done in the same transaction that
- logs the initial extent swap intent log item.
- The regular atomic extent swap is used to exchange the mappings.
- Special flags are set on the swap operation so that the transaction can be
- rolled one more time to convert the second file's fork back to local format
- so that the second file will be ready to go as soon as the ILOCK is dropped.
+ logs the initial mapping exchange intent log item.
+ The regular atomic mapping exchange is used to exchange the metadata file
+ mappings.
+ Special flags are set on the exchange operation so that the transaction can
+ be rolled one more time to convert the second file's fork back to local
+ format so that the second file will be ready to go as soon as the ILOCK is
+ dropped.
Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
To repair a metadata file, online repair proceeds as follows:
@@ -4271,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
file.
The same fork must be written to as is being repaired.
-3. Commit the scrub transaction, since the swap estimation step must be
- completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+ must be completed before transaction reservations are made.
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
- xfs_swapext_req`` with the details of the swap operation.
+ xfs_exchmaps_req`` with the details of the exchange operation.
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
6. Commit the transaction to complete the repair.
@@ -4320,7 +4320,7 @@ To check the summary file against the bitmap:
3. Compare the contents of the xfile against the ondisk file.
To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.
The proposed patchset is the
@@ -4363,8 +4363,8 @@ Salvaging extended attributes is done as follows:
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.
-3. Use atomic extent swapping to exchange the new and old extended attribute
- structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+ attribute structures.
The old attribute blocks are now attached to the temporary file.
4. Reap the temporary file.
@@ -4421,7 +4421,8 @@ salvaging directories is straightforward:
directory and add the staged dirents into the temporary directory.
Truncate the staging files.
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+ structures.
The old directory blocks are now attached to the temporary file.
5. Reap the temporary file.
@@ -4464,10 +4465,10 @@ reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.
-XFS parent pointers include the dirent name and location of the entry within
-the parent directory.
+XFS parent pointers contain the information needed to identify the
+corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
-parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+parents in the form ``(dirent_name) → (parent_inum, parent_gen)``.
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
@@ -4475,8 +4476,6 @@ each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.
-**Note**: The ondisk format of parent pointers is not yet finalized.
-
+--------------------------------------------------------------------------+
| **Historical Sidebar**: |
+--------------------------------------------------------------------------+
@@ -4518,8 +4517,58 @@ Both online and offline repair can use this strategy.
| Chandan increased the maximum extent counts of both data and attribute |
| forks, thereby ensuring that the extended attribute structure can grow |
| to handle the maximum hardlink count of any file. |
+| |
+| For this second effort, the ondisk parent pointer format as originally |
+| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. |
+| The format was changed during development to eliminate the requirement |
+| of repair tools needing to to ensure that the ``dirent_pos`` field |
+| always matched when reconstructing a directory. |
+| |
+| There were a few other ways to have solved that problem: |
+| |
+| 1. The field could be designated advisory, since the other three values |
+| are sufficient to find the entry in the parent. |
+| However, this makes indexed key lookup impossible while repairs are |
+| ongoing. |
+| |
+| 2. We could allow creating directory entries at specified offsets, which |
+| solves the referential integrity problem but runs the risk that |
+| dirent creation will fail due to conflicts with the free space in the |
+| directory. |
+| |
+| These conflicts could be resolved by appending the directory entry |
+| and amending the xattr code to support updating an xattr key and |
+| reindexing the dabtree, though this would have to be performed with |
+| the parent directory still locked. |
+| |
+| 3. Same as above, but remove the old parent pointer entry and add a new |
+| one atomically. |
+| |
+| 4. Change the ondisk xattr format to |
+| ``(parent_inum, name) → (parent_gen)``, which would provide the attr |
+| name uniqueness that we require, without forcing repair code to |
+| update the dirent position. |
+| Unfortunately, this requires changes to the xattr code to support |
+| attr names as long as 263 bytes. |
+| |
+| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → |
+| (name, parent_gen)``. |
+| If the hash is sufficiently resistant to collisions (e.g. sha256) |
+| then this should provide the attr name uniqueness that we require. |
+| Names shorter than 247 bytes could be stored directly. |
+| |
+| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, |
+| parent_gen)``. This format doesn't require any of the complicated |
+| nested name hashing of the previous suggestions. However, it was |
+| discovered that multiple hardlinks to the same inode with the same |
+| filename caused performance problems with hashed xattr lookups, so |
+| the parent inumber is now xor'd into the hash index. |
+| |
+| In the end, it was decided that solution #6 was the most compact and the |
+| most performant. A new hash function was designed for parent pointers. |
+--------------------------------------------------------------------------+
+
Case Study: Repairing Directories with Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4527,8 +4576,9 @@ Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
a :ref:`directory entry live update hook <liveupdate>` as follows:
1. Set up a temporary directory for generating the new directory structure,
- an xfblob for storing entry names, and an xfarray for stashing directory
- updates.
+ an xfblob for storing entry names, and an xfarray for stashing the fixed
+ size fields involved in a directory update: ``(child inumber, add vs.
+ remove, name cookie, ftype)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4537,73 +4587,36 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
pointer references the directory of interest.
If so:
- a. Stash an addname entry for this dirent in the xfarray for later.
+ a. Stash the parent pointer name and an addname entry for this dirent in the
+ xfblob and xfarray, respectively.
- b. When finished scanning that file, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning that file or the kernel memory consumption exceeds
+ a threshold, flush the stashed updates to the temporary directory.
4. For each live directory update received via the hook, decide if the child
has already been scanned.
If so:
- a. Stash an addname or removename entry for this dirent update in the
- xfarray for later.
+ a. Stash the parent pointer name an addname or removename entry for this
+ dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
-6. Reap the temporary directory.
-
-7. Update the dirent position field of parent pointers as necessary.
- This may require the queuing of a substantial number of xattr log intent
- items.
+7. Reap the temporary directory.
The proposed patchset is the
`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
-**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
-match in the reconstructed directory?
-
-*Answer*: There are a few ways to solve this problem:
-
-1. The field could be designated advisory, since the other three values are
- sufficient to find the entry in the parent.
- However, this makes indexed key lookup impossible while repairs are ongoing.
-
-2. We could allow creating directory entries at specified offsets, which solves
- the referential integrity problem but runs the risk that dirent creation
- will fail due to conflicts with the free space in the directory.
-
- These conflicts could be resolved by appending the directory entry and
- amending the xattr code to support updating an xattr key and reindexing the
- dabtree, though this would have to be performed with the parent directory
- still locked.
-
-3. Same as above, but remove the old parent pointer entry and add a new one
- atomically.
-
-4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
- which would provide the attr name uniqueness that we require, without
- forcing repair code to update the dirent position.
- Unfortunately, this requires changes to the xattr code to support attr
- names as long as 263 bytes.
-
-5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
- (name, parent_gen)``.
- If the hash is sufficiently resistant to collisions (e.g. sha256) then
- this should provide the attr name uniqueness that we require.
- Names shorter than 247 bytes could be stored directly.
-
-Discussion is ongoing under the `parent pointers patch deluge
-<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
-
Case Study: Repairing Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4611,8 +4624,9 @@ Online reconstruction of a file's parent pointer information works similarly to
directory reconstruction:
1. Set up a temporary file for generating a new extended attribute structure,
- an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
- stashing parent pointer updates.
+ an xfblob for storing parent pointer names, and an xfarray for stashing the
+ fixed size fields involved in a parent pointer update: ``(parent inumber,
+ parent generation, add vs. remove, name cookie)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4621,34 +4635,36 @@ directory reconstruction:
dirent references the file of interest.
If so:
- a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
- for later.
+ a. Stash the dirent name and an addpptr entry for this parent pointer in the
+ xfblob and xfarray, respectively.
- b. When finished scanning the directory, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning the directory or the kernel memory consumption
+ exceeds a threshold, flush the stashed updates to the temporary file.
4. For each live directory update received via the hook, decide if the parent
has already been scanned.
If so:
- a. Stash an addpptr or removepptr entry for this dirent update in the
- xfarray for later.
+ a. Stash the dirent name and an addpptr or removepptr entry for this dirent
+ update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.
-5. Copy all non-parent pointer extended attributes to the temporary file.
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. Copy all non-parent pointer extended attributes to the temporary file.
-6. When the scan is complete, atomically swap the attribute fork of the
- temporary file and the file being repaired.
+7. When the scan is complete, atomically exchange the mappings of the attribute
+ forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.
-7. Reap the temporary file.
+8. Reap the temporary file.
The proposed patchset is the
`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
Digression: Offline Checking of Parent Pointers
@@ -4659,26 +4675,56 @@ files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:
-1. After the set of surviving files has been established (i.e. phase 6),
+1. After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.
-2. For each directory entry found, record the name in an xfblob, and store
- ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
- per-AG in-memory slab.
+2. For each directory entry found,
+
+ a. If the name has already been stored in the xfblob, then use that cookie
+ and skip the next step.
+
+ b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
+ Unique mappings are critical for
+
+ 1. Deduplicating names to reduce memory usage, and
+
+ 2. Creating a stable sort key for the parent pointer indexes so that the
+ parent pointer validation described below will work.
+
+ c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
+ referenced in this section is the regular directory entry name hash, not
+ the specialized one used for parent pointer xattrs.
3. For each AG in the filesystem,
- a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
- dirent_pos.
+ a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
+ ``name_hash``, and ``name_cookie``.
+ Having a single ``name_cookie`` for each ``name`` is critical for
+ handling the uncommon case of a directory containing multiple hardlinks
+ to the same file where all the names hash to the same value.
b. For each inode in the AG,
1. Scan the inode for parent pointers.
- Record the names in a per-file xfblob, and store ``(parent_inum,
- parent_gen, dirent_pos)`` tuples in a per-file slab.
+ For each parent pointer found,
+
+ a. Validate the ondisk parent pointer.
+ If validation fails, move on to the next parent pointer in the
+ file.
+
+ b. If the name has already been stored in the xfblob, then use that
+ cookie and skip the next step.
+
+ c. Record the name in a per-file xfblob, and remember the xfblob
+ cookie.
- 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+ d. Store ``(parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-file slab.
+
+ 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
+ and ``name_cookie``.
3. Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
@@ -4687,28 +4733,37 @@ connectivity checks:
4. Position a second slab cursor at the start of the per-file tuple slab.
- 5. Iterate the two cursors in lockstep, comparing the parent_ino and
- dirent_pos fields of the records under each cursor.
+ 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
+ ``name_hash``, and ``name_cookie`` fields of the records under each
+ cursor:
- a. Tuples in the per-AG list but not the per-file list are missing and
- need to be written to the inode.
+ a. If the per-AG cursor is at a lower point in the keyspace than the
+ per-file cursor, then the per-AG cursor points to a missing parent
+ pointer.
+ Add the parent pointer to the inode and advance the per-AG
+ cursor.
- b. Tuples in the per-file list but not the per-AG list are dangling
- and need to be removed from the inode.
+ b. If the per-file cursor is at a lower point in the keyspace than
+ the per-AG cursor, then the per-file cursor points to a dangling
+ parent pointer.
+ Remove the parent pointer from the inode and advance the per-file
+ cursor.
- c. For tuples in both lists, update the parent_gen and name components
- of the parent pointer if necessary.
+ c. Otherwise, both cursors point at the same parent pointer.
+ Update the parent_gen component if necessary.
+ Advance both cursors.
4. Move on to examining link counts, as we do today.
The proposed patchset is the
`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
series.
-Rebuilding directories from parent pointers in offline repair is very
-challenging because it currently uses a single-pass scan of the filesystem
-during phase 3 to decide which files are corrupt enough to be zapped.
+Rebuilding directories from parent pointers in offline repair would be very
+challenging because xfs_repair currently uses two single-pass scans of the
+filesystem during phases 3 and 4 to decide which files are corrupt enough to be
+zapped.
This scan would have to be converted into a multi-pass scan:
1. The first pass of the scan zaps corrupt inodes, forks, and attributes
@@ -4730,6 +4785,130 @@ This scan would have to be converted into a multi-pass scan:
This code has not yet been constructed.
+.. _dirtree:
+
+Case Study: Directory Tree Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned earlier, the filesystem directory tree is supposed to be a
+directed acylic graph structure.
+However, each node in this graph is a separate ``xfs_inode`` object with its
+own locks, which makes validating the tree qualities difficult.
+Fortunately, non-directories are allowed to have multiple parents and cannot
+have children, so only directories need to be scanned.
+Directories typically constitute 5-10% of the files in a filesystem, which
+reduces the amount of work dramatically.
+
+If the directory tree could be frozen, it would be easy to discover cycles and
+disconnected regions by running a depth (or breadth) first search downwards
+from the root directory and marking a bitmap for each directory found.
+At any point in the walk, trying to set an already set bit means there is a
+cycle.
+After the scan completes, XORing the marked inode bitmap with the inode
+allocation bitmap reveals disconnected inodes.
+However, one of online repair's design goals is to avoid locking the entire
+filesystem unless it's absolutely necessary.
+Directory tree updates can move subtrees across the scanner wavefront on a live
+filesystem, so the bitmap algorithm cannot be applied.
+
+Directory parent pointers enable an incremental approach to validation of the
+tree structure.
+Instead of using one thread to scan the entire filesystem, multiple threads can
+walk from individual subdirectories upwards towards the root.
+For this to work, all directory entries and parent pointers must be internally
+consistent, each directory entry must have a parent pointer, and the link
+counts of all directories must be correct.
+Each scanner thread must be able to take the IOLOCK of an alleged parent
+directory while holding the IOLOCK of the child directory to prevent either
+directory from being moved within the tree.
+This is not possible since the VFS does not take the IOLOCK of a child
+subdirectory when moving that subdirectory, so instead the scanner stabilizes
+the parent -> child relationship by taking the ILOCKs and installing a dirent
+update hook to detect changes.
+
+The scanning process uses a dirent hook to detect changes to the directories
+mentioned in the scan data.
+The scan works as follows:
+
+1. For each subdirectory in the filesystem,
+
+ a. For each parent pointer of that subdirectory,
+
+ 1. Create a path object for that parent pointer, and mark the
+ subdirectory inode number in the path object's bitmap.
+
+ 2. Record the parent pointer name and inode number in a path structure.
+
+ 3. If the alleged parent is the subdirectory being scrubbed, the path is
+ a cycle.
+ Mark the path for deletion and repeat step 1a with the next
+ subdirectory parent pointer.
+
+ 4. Try to mark the alleged parent inode number in a bitmap in the path
+ object.
+ If the bit is already set, then there is a cycle in the directory
+ tree.
+ Mark the path as a cycle and repeat step 1a with the next subdirectory
+ parent pointer.
+
+ 5. Load the alleged parent.
+ If the alleged parent is not a linked directory, abort the scan
+ because the parent pointer information is inconsistent.
+
+ 6. For each parent pointer of this alleged ancestor directory,
+
+ a. Record the parent pointer name and inode number in the path object
+ if no parent has been set for that level.
+
+ b. If an ancestor has more than one parent, mark the path as corrupt.
+ Repeat step 1a with the next subdirectory parent pointer.
+
+ c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
+ This repeats until the directory tree root is reached or no parents
+ are found.
+
+ 7. If the walk terminates at the root directory, mark the path as ok.
+
+ 8. If the walk terminates without reaching the root, mark the path as
+ disconnected.
+
+2. If the directory entry update hook triggers, check all paths already found
+ by the scan.
+ If the entry matches part of a path, mark that path and the scan stale.
+ When the scanner thread sees that the scan has been marked stale, it deletes
+ all scan data and starts over.
+
+Repairing the directory tree works as follows:
+
+1. Walk each path of the target subdirectory.
+
+ a. Corrupt paths and cycle paths are counted as suspect.
+
+ b. Paths already marked for deletion are counted as bad.
+
+ c. Paths that reached the root are counted as good.
+
+2. If the subdirectory is either the root directory or has zero link count,
+ delete all incoming directory entries in the immediate parents.
+ Repairs are complete.
+
+3. If the subdirectory has exactly one path, set the dotdot entry to the
+ parent and exit.
+
+4. If the subdirectory has at least one good path, delete all the other
+ incoming directory entries in the immediate parents.
+
+5. If the subdirectory has no good paths and more than one suspect path, delete
+ all the other incoming directory entries in the immediate parents.
+
+6. If the subdirectory has zero paths, attach it to the lost and found.
+
+The proposed patches are in the
+`directory tree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
+series.
+
+
.. _orphanage:
The Orphanage
@@ -4777,14 +4956,22 @@ Orphaned files are adopted by the orphanage as follows:
The ``xrep_orphanage_iolock_two`` function follows the inode locking
strategy discussed earlier.
-3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
- to compute the new name in the orphanage and the block reservation required.
-
-4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair
transaction.
-5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
- and found, and update the kernel dentry cache.
+4. Call ``xrep_orphanage_compute_name`` to compute the new name in the
+ orphanage.
+
+5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to
+ reparent the orphaned file into the lost and found and invalidate the dentry
+ cache.
+
+6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the
+ orphanage ILOCK, and clean the scrub transaction. Call
+ ``xrep_adoption_commit`` to commit the updates and the scrub transaction.
+
+7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
+ resources.
The proposed patches are in the
`orphanage adoption
@@ -5116,18 +5303,18 @@ make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5142,12 +5329,12 @@ the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
@@ -5158,11 +5345,11 @@ The extra flexibility enables several new use cases:
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
- Finally, the process calls the atomic extent swap system call
- (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
- of the updates to the original file, or none of them.
+ Finally, the process calls the atomic file mapping exchange system call
+ (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+ committing all of the updates to the original file, or none of them.
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
- **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
@@ -5171,16 +5358,17 @@ The extra flexibility enables several new use cases:
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
- into the kernel as arguments to the atomic extent swap system call.
+ into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
+ A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
- **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
- atomic extent swap system call with a flag to indicate that holes in the
- temporary file should be ignored.
+ atomic file mapping exchange system call with a flag to indicate that holes
+ in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
@@ -5262,8 +5450,8 @@ of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.