summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst66
-rw-r--r--Documentation/filesystems/api-summary.rst3
-rw-r--r--Documentation/filesystems/autofs.rst6
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst186
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst105
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst108
-rw-r--r--Documentation/filesystems/bcachefs/future/idle_work.rst78
-rw-r--r--Documentation/filesystems/bcachefs/index.rst29
-rw-r--r--Documentation/filesystems/buffer.rst12
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst2
-rw-r--r--Documentation/filesystems/caching/fscache.rst8
-rw-r--r--Documentation/filesystems/ceph.rst15
-rw-r--r--Documentation/filesystems/coda.rst2
-rw-r--r--Documentation/filesystems/dax.rst1
-rw-r--r--Documentation/filesystems/debugfs.rst31
-rw-r--r--Documentation/filesystems/directory-locking.rst4
-rw-r--r--Documentation/filesystems/dlmfs.rst2
-rw-r--r--Documentation/filesystems/efivarfs.rst2
-rw-r--r--Documentation/filesystems/erofs.rst3
-rw-r--r--Documentation/filesystems/ext4/atomic_writes.rst225
-rw-r--r--Documentation/filesystems/ext4/overview.rst1
-rw-r--r--Documentation/filesystems/ext4/super.rst20
-rw-r--r--Documentation/filesystems/f2fs.rst126
-rw-r--r--Documentation/filesystems/fiemap.rst49
-rw-r--r--Documentation/filesystems/fscrypt.rst197
-rw-r--r--Documentation/filesystems/fsverity.rst45
-rw-r--r--Documentation/filesystems/fuse-io-uring.rst99
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst55
-rw-r--r--Documentation/filesystems/idmappings.rst12
-rw-r--r--Documentation/filesystems/index.rst6
-rw-r--r--Documentation/filesystems/iomap/design.rst462
-rw-r--r--Documentation/filesystems/iomap/index.rst13
-rw-r--r--Documentation/filesystems/iomap/operations.rst744
-rw-r--r--Documentation/filesystems/iomap/porting.rst120
-rw-r--r--Documentation/filesystems/journalling.rst10
-rw-r--r--Documentation/filesystems/locking.rst69
-rw-r--r--Documentation/filesystems/mount_api.rst28
-rw-r--r--Documentation/filesystems/multigrain-ts.rst125
-rw-r--r--Documentation/filesystems/netfs_library.rst1019
-rw-r--r--Documentation/filesystems/nfs/exporting.rst7
-rw-r--r--Documentation/filesystems/nfs/index.rst1
-rw-r--r--Documentation/filesystems/nfs/localio.rst357
-rw-r--r--Documentation/filesystems/nfs/reexport.rst10
-rw-r--r--Documentation/filesystems/overlayfs.rst56
-rw-r--r--Documentation/filesystems/path-lookup.rst2
-rw-r--r--Documentation/filesystems/path-lookup.txt2
-rw-r--r--Documentation/filesystems/porting.rst123
-rw-r--r--Documentation/filesystems/proc.rst187
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.rst2
-rw-r--r--Documentation/filesystems/relay.rst36
-rw-r--r--Documentation/filesystems/resctrl.rst1523
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst26
-rw-r--r--Documentation/filesystems/squashfs.rst14
-rw-r--r--Documentation/filesystems/sysv-fs.rst264
-rw-r--r--Documentation/filesystems/tmpfs.rst24
-rw-r--r--Documentation/filesystems/vfs.rst105
-rw-r--r--Documentation/filesystems/xfs/xfs-delayed-logging-design.rst2
-rw-r--r--Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst2
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst636
59 files changed, 6284 insertions, 1183 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 1e0e0bb6fdf9..be3504ca034a 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -31,7 +31,7 @@ Other applications are described in the following papers:
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
+ https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf
Usage
=====
@@ -40,7 +40,7 @@ For remote file server::
mount -t 9p 10.10.1.2 /mnt/9
-For Plan 9 From User Space applications (http://swtch.com/plan9)::
+For Plan 9 From User Space applications (https://9fans.github.io/plan9port/)::
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
@@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-where mount_tag is the tag associated by the server to each of the exported
+where mount_tag is the tag generated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+USBG Usage
+==========
+
+To mount a 9p FS on a USB Host accessible via the gadget at runtime::
+
+ mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9
+
+To mount a 9p FS on a USB Host accessible via the gadget as root filesystem::
+
+ root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs
+
+where <device> is the tag associated by the usb gadget transport.
+It is defined by the configfs instance name.
+
+USBG Example
+============
+
+The USB host exports a filesystem, while the gadget on the USB device
+side makes it mountable.
+
+Diod (9pfs server) and the forwarder are on the development host, where
+the root filesystem is actually stored. The gadget is initialized during
+boot (or later) on the embedded board. Then the forwarder will find it
+on the USB bus and start forwarding requests.
+
+In this case the 9p requests come from the device and are handled by the
+host. The reason is that USB device ports are normally not available on
+PCs, so a connection in the other direction would not work.
+
+When using the usbg transport, for now there is no native usb host
+service capable to handle the requests from the gadget driver. For
+this we have to use the extra python tool p9_fwd.py from tools/usb.
+
+Just start the 9pfs capable network server like diod/nfs-ganesha e.g.::
+
+ $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD
+
+Optionally scan your bus if there are more then one usbg gadgets to find their path::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py list
+
+ Bus | Addr | Manufacturer | Product | ID | Path
+ --- | ---- | ---------------- | ---------------- | --------- | ----
+ 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2
+ 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3
+
+Then start the python transport::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999
+
+After that the gadget driver can be used as described above.
+
+One use-case is to use it as an alternative to NFS root booting during
+the development of embedded Linux devices.
+
Options
=======
@@ -68,6 +123,7 @@ Options
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
+ usbg connect to a specified usb gadget channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
@@ -109,8 +165,8 @@ Options
do not necessarily validate cached values on the server. In other
words changes on the server are not guaranteed to be reflected
on the client system. Only use this mode of operation if you
- have an exclusive mount and the server will modify the filesystem
- underneath you.
+ have an exclusive mount and the server will not modify the
+ filesystem underneath you.
debug=n specifies debug level. The debug level is a bitmask.
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index 98db2ea5fa12..cc5cc7f3fbd8 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -56,9 +56,6 @@ Other Functions
.. kernel-doc:: fs/namei.c
:export:
-.. kernel-doc:: fs/buffer.c
- :export:
-
.. kernel-doc:: block/bio.c
:export:
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 3b6e38e646cd..5eb02394fcc3 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -18,7 +18,7 @@ key advantages:
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
- in that data base at the time of access will be used to provide
+ in that database at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
@@ -423,7 +423,7 @@ The available ioctl commands are:
and objects are expired if the are not in use.
**AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
+ and objects are expired even if they are in use. This assumes
that the daemon has requested this because it is capable of
performing the umount.
@@ -442,7 +442,7 @@ which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
The 'ioctl's that can be used on this device are described in a separate
-document `autofs-mount-control.txt`, and are summarised briefly here.
+document `autofs-mount-control.rst`, and are summarised briefly here.
Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct autofs_dev_ioctl {
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
new file mode 100644
index 000000000000..b29562a6bf55
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/CodingStyle.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs coding style
+=====================
+
+Good development is like gardening, and codebases are our gardens. Tend to them
+every day; look for little things that are out of place or in need of tidying.
+A little weeding here and there goes a long way; don't wait until things have
+spiraled out of control.
+
+Things don't always have to be perfect - nitpicking often does more harm than
+good. But appreciate beauty when you see it - and let people know.
+
+The code that you are afraid to touch is the code most in need of refactoring.
+
+A little organizing here and there goes a long way.
+
+Put real thought into how you organize things.
+
+Good code is readable code, where the structure is simple and leaves nowhere
+for bugs to hide.
+
+Assertions are one of our most important tools for writing reliable code. If in
+the course of writing a patchset you encounter a condition that shouldn't
+happen (and will have unpredictable or undefined behaviour if it does), or
+you're not sure if it can happen and not sure how to handle it yet - make it a
+BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
+
+By the time you finish the patchset, you should understand better which
+assertions need to be handled and turned into checks with error paths, and
+which should be logically impossible. Leave the BUG_ON()s in for the ones which
+are logically impossible. (Or, make them debug mode assertions if they're
+expensive - but don't turn everything into a debug mode assertion, so that
+we're not stuck debugging undefined behaviour should it turn out that you were
+wrong).
+
+Assertions are documentation that can't go out of date. Good assertions are
+wonderful.
+
+Good assertions drastically and dramatically reduce the amount of testing
+required to shake out bugs.
+
+Good assertions are based on state, not logic. To write good assertions, you
+have to think about what the invariants on your state are.
+
+Good invariants and assertions will hold everywhere in your codebase. This
+means that you can run them in only a few places in the checked in version, but
+should you need to debug something that caused the assertion to fail, you can
+quickly shotgun them everywhere to find the codepath that broke the invariant.
+
+A good assertion checks something that the compiler could check for us, and
+elide - if we were working in a language with embedded correctness proofs that
+the compiler could check. This is something that exists today, but it'll likely
+still be a few decades before it comes to systems programming languages. But we
+can still incorporate that kind of thinking into our code and document the
+invariants with runtime checks - much like the way people working in
+dynamically typed languages may add type annotations, gradually making their
+code statically typed.
+
+Looking for ways to make your assertions simpler - and higher level - will
+often nudge you towards making the entire system simpler and more robust.
+
+Good code is code where you can poke around and see what it's doing -
+introspection. We can't debug anything if we can't see what's going on.
+
+Whenever we're debugging, and the solution isn't immediately obvious, if the
+issue is that we don't know where the issue is because we can't see what's
+going on - fix that first.
+
+We have the tools to make anything visible at runtime, efficiently - RCU and
+percpu data structures among them. Don't let things stay hidden.
+
+The most important tool for introspection is the humble pretty printer - in
+bcachefs, this means `*_to_text()` functions, which output to printbufs.
+
+Pretty printers are wonderful, because they compose and you can use them
+everywhere. Having functions to print whatever object you're working with will
+make your error messages much easier to write (therefore they will actually
+exist) and much more informative. And they can be used from sysfs/debugfs, as
+well as tracepoints.
+
+Runtime info and debugging tools should come with clear descriptions and
+labels, and good structure - we don't want files with a list of bare integers,
+like in procfs. Part of the job of the debugging tools is to educate users and
+new developers as to how the system works.
+
+Error messages should, whenever possible, tell you everything you need to debug
+the issue. It's worth putting effort into them.
+
+Tracepoints shouldn't be the first thing you reach for. They're an important
+tool, but always look for more immediate ways to make things visible. When we
+have to rely on tracing, we have to know which tracepoints we're looking for,
+and then we have to run the troublesome workload, and then we have to sift
+through logs. This is a lot of steps to go through when a user is hitting
+something, and if it's intermittent it may not even be possible.
+
+The humble counter is an incredibly useful tool. They're cheap and simple to
+use, and many complicated internal operations with lots of things that can
+behave weirdly (anything involving memory reclaim, for example) become
+shockingly easy to debug once you have counters on every distinct codepath.
+
+Persistent counters are even better.
+
+When debugging, try to get the most out of every bug you come across; don't
+rush to fix the initial issue. Look for things that will make related bugs
+easier the next time around - introspection, new assertions, better error
+messages, new debug tools, and do those first. Look for ways to make the system
+better behaved; often one bug will uncover several other bugs through
+downstream effects.
+
+Fix all that first, and then the original bug last - even if that means keeping
+a user waiting. They'll thank you in the long run, and when they understand
+what you're doing you'll be amazed at how patient they're happy to be. Users
+like to help - otherwise they wouldn't be reporting the bug in the first place.
+
+Talk to your users. Don't isolate yourself.
+
+Users notice all sorts of interesting things, and by just talking to them and
+interacting with them you can benefit from their experience.
+
+Spend time doing support and helpdesk stuff. Don't just write code - code isn't
+finished until it's being used trouble free.
+
+This will also motivate you to make your debugging tools as good as possible,
+and perhaps even your documentation, too. Like anything else in life, the more
+time you spend at it the better you'll get, and you the developer are the
+person most able to improve the tools to make debugging quick and easy.
+
+Be wary of how you take on and commit to big projects. Don't let development
+become product-manager focused. Often time an idea is a good one but needs to
+wait for its proper time - but you won't know if it's the proper time for an
+idea until you start writing code.
+
+Expect to throw a lot of things away, or leave them half finished for later.
+Nobody writes all perfect code that all gets shipped, and you'll be much more
+productive in the long run if you notice this early and shift to something
+else. The experience gained and lessons learned will be valuable for all the
+other work you do.
+
+But don't be afraid to tackle projects that require significant rework of
+existing code. Sometimes these can be the best projects, because they can lead
+us to make existing code more general, more flexible, more multipurpose and
+perhaps more robust. Just don't hesitate to abandon the idea if it looks like
+it's going to make a mess of things.
+
+Complicated features can often be done as a series of refactorings, with the
+final change that actually implements the feature as a quite small patch at the
+end. It's wonderful when this happens, especially when those refactorings are
+things that improve the codebase in their own right. When that happens there's
+much less risk of wasted effort if the feature you were going for doesn't work
+out.
+
+Always strive to work incrementally. Always strive to turn the big projects
+into little bite sized projects that can prove their own merits.
+
+Instead of always tackling those big projects, look for little things that
+will be useful, and make the big projects easier.
+
+The question of what's likely to be useful is where junior developers most
+often go astray - doing something because it seems like it'll be useful often
+leads to overengineering. Knowing what's useful comes from many years of
+experience, or talking with people who have that experience - or from simply
+reading lots of code and looking for common patterns and issues. Don't be
+afraid to throw things away and do something simpler.
+
+Talk about your ideas with your fellow developers; often times the best things
+come from relaxed conversations where people aren't afraid to say "what if?".
+
+Don't neglect your tools.
+
+The most important tools (besides the compiler and our text editor) are the
+tools we use for testing. The shortest possible edit/test/debug cycle is
+essential for working productively. We learn, gain experience, and discover the
+errors in our thinking by running our code and seeing what happens. If your
+time is being wasted because your tools are bad or too slow - don't accept it,
+fix it.
+
+Put effort into your documentation, commit messages, and code comments - but
+don't go overboard. A good commit message is wonderful - but if the information
+was important enough to go in a commit message, ask yourself if it would be
+even better as a code comment.
+
+A good code comment is wonderful, but even better is the comment that didn't
+need to exist because the code was so straightforward as to be obvious;
+organized into small clean and tidy modules, with clear and descriptive names
+for functions and variables, where every line of code has a clear purpose.
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
new file mode 100644
index 000000000000..18c79d548391
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
@@ -0,0 +1,105 @@
+Submitting patches to bcachefs
+==============================
+
+Here are suggestions for submitting patches to bcachefs subsystem.
+
+Submission checklist
+--------------------
+
+Patches must be tested before being submitted, either with the xfstests suite
+[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
+touched. Note that ktest wraps xfstests and will be an easier method to running
+it for most users; it includes single-command wrappers for all the mainstream
+in-kernel local filesystems.
+
+Patches will undergo more testing after being merged (including
+lockdep/kasan/preempt/etc. variants), these are not generally required to be
+run by the submitter - but do put some thought into what you're changing and
+which tests might be relevant, e.g. are you dealing with tricky memory layout
+work? kasan, are you doing locking work? then lockdep; and ktest includes
+single-command variants for the debug build types you'll most likely need.
+
+The exception to this rule is incomplete WIP/RFC patches: if you're working on
+something nontrivial, it's encouraged to send out a WIP patch to let people
+know what you're doing and make sure you're on the right track. Just make sure
+it includes a brief note as to what's done and what's incomplete, to avoid
+confusion.
+
+Rigorous checkpatch.pl adherence is not required (many of its warnings are
+considered out of date), but try not to deviate too much without reason.
+
+Focus on writing code that reads well and is organized well; code should be
+aesthetically pleasing.
+
+CI
+--
+
+Instead of running your tests locally, when running the full test suite it's
+preferable to let a server farm do it in parallel, and then have the results
+in a nice test dashboard (which can tell you which failures are new, and
+presents results in a git log view, avoiding the need for most bisecting).
+
+That exists [2]_, and community members may request an account. If you work for
+a big tech company, you'll need to help out with server costs to get access -
+but the CI is not restricted to running bcachefs tests: it runs any ktest test
+(which generally makes it easy to wrap other tests that can run in qemu).
+
+Other things to think about
+---------------------------
+
+- How will we debug this code? Is there sufficient introspection to diagnose
+ when something starts acting wonky on a user machine?
+
+ We don't necessarily need every single field of every data structure visible
+ with introspection, but having the important fields of all the core data
+ types wired up makes debugging drastically easier - a bit of thoughtful
+ foresight greatly reduces the need to have people build custom kernels with
+ debug patches.
+
+ More broadly, think about all the debug tooling that might be needed.
+
+- Does it make the codebase more or less of a mess? Can we also try to do some
+ organizing, too?
+
+- Do new tests need to be written? New assertions? How do we know and verify
+ that the code is correct, and what happens if something goes wrong?
+
+ We don't yet have automated code coverage analysis or easy fault injection -
+ but for now, pretend we did and ask what they might tell us.
+
+ Assertions are hugely important, given that we don't yet have a systems
+ language that can do ergonomic embedded correctness proofs. Hitting an assert
+ in testing is much better than wandering off into undefined behaviour la-la
+ land - use them. Use them judiciously, and not as a replacement for proper
+ error handling, but use them.
+
+- Does it need to be performance tested? Should we add new performance counters?
+
+ bcachefs has a set of persistent runtime counters which can be viewed with
+ the 'bcachefs fs top' command; this should give users a basic idea of what
+ their filesystem is currently doing. If you're doing a new feature or looking
+ at old code, think if anything should be added.
+
+- If it's a new on disk format feature - have upgrades and downgrades been
+ tested? (Automated tests exists but aren't in the CI, due to the hassle of
+ disk image management; coordinate to have them run.)
+
+Mailing list, IRC
+-----------------
+
+Patches should hit the list [3]_, but much discussion and code review happens
+on IRC as well [4]_; many people appreciate the more conversational approach
+and quicker feedback.
+
+Additionally, we have a lively user community doing excellent QA work, which
+exists primarily on IRC. Please make use of that resource; user feedback is
+important for any nontrivial feature, and documenting it in commit messages
+would be a good idea.
+
+.. rubric:: References
+
+.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+.. [1] https://evilpiepirate.org/git/ktest.git/
+.. [2] https://evilpiepirate.org/~testdashboard/ci/
+.. [3] linux-bcachefs@vger.kernel.org
+.. [4] irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
new file mode 100644
index 000000000000..871a38f557e8
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -0,0 +1,108 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+* Regular: [dirent data][regular name][nul][nul]...
+* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of NULs here is merely for illustration; their count can
+vary per-key, and they may not even be present if the key is aligned to
+`sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
+dentry/dcache considerations
+----------------------------
+
+Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
+negative dentry's.
+
+This is because currently doing so presents a problem in the following scenario:
+
+ - Lookup file "blAH" in a casefolded directory
+ - Creation of file "BLAH" in a casefolded directory
+ - Lookup file "blAH" in a casefolded directory
+
+This would fail if negative dentry's were cached.
+
+This is slightly suboptimal, but could be fixed in future with some vfs work.
+
+
+References
+----------
+
+(from Peter Anvin, on the list)
+
+It is worth noting that Microsoft has basically declared their
+"recommended" case folding (upcase) table to be permanently frozen (for
+new filesystem instances in the case where they use an on-disk
+translation table created at format time.) As far as I know they have
+never supported anything other than 1:1 conversion of BMP code points,
+nor normalization.
+
+The exFAT specification enumerates the full recommended upcase table,
+although in a somewhat annoying format (basically a hex dump of
+compressed data):
+
+https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
new file mode 100644
index 000000000000..59a332509dcd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/future/idle_work.rst
@@ -0,0 +1,78 @@
+Idle/background work classes design doc:
+
+Right now, our behaviour at idle isn't ideal, it was designed for servers that
+would be under sustained load, to keep pending work at a "medium" level, to
+let work build up so we can process it in more efficient batches, while also
+giving headroom for bursts in load.
+
+But for desktops or mobile - scenarios where work is less sustained and power
+usage is more important - we want to operate differently, with a "rush to
+idle" so the system can go to sleep. We don't want to be dribbling out
+background work while the system should be idle.
+
+The complicating factor is that there are a number of background tasks, which
+form a heirarchy (or a digraph, depending on how you divide it up) - one
+background task may generate work for another.
+
+Thus proper idle detection needs to model this heirarchy.
+
+- Foreground writes
+- Page cache writeback
+- Copygc, rebalance
+- Journal reclaim
+
+When we implement idle detection and rush to idle, we need to be careful not
+to disturb too much the existing behaviour that works reasonably well when the
+system is under sustained load (or perhaps improve it in the case of
+rebalance, which currently does not actively attempt to let work batch up).
+
+SUSTAINED LOAD REGIME
+---------------------
+
+When the system is under continuous load, we want these jobs to run
+continuously - this is perhaps best modelled with a P/D controller, where
+they'll be trying to keep a target value (i.e. fragmented disk space,
+available journal space) roughly in the middle of some range.
+
+The goal under sustained load is to balance our ability to handle load spikes
+without running out of x resource (free disk space, free space in the
+journal), while also letting some work accumululate to be batched (or become
+unnecessary).
+
+For example, we don't want to run copygc too aggressively, because then it
+will be evacuating buckets that would have become empty (been overwritten or
+deleted) anyways, and we don't want to wait until we're almost out of free
+space because then the system will behave unpredicably - suddenly we're doing
+a lot more work to service each write and the system becomes much slower.
+
+IDLE REGIME
+-----------
+
+When the system becomes idle, we should start flushing our pending work
+quicker so the system can go to sleep.
+
+Note that the definition of "idle" depends on where in the heirarchy a task
+is - a task should start flushing work more quickly when the task above it has
+stopped generating new work.
+
+e.g. rebalance should start flushing more quickly when page cache writeback is
+idle, and journal reclaim should only start flushing more quickly when both
+copygc and rebalance are idle.
+
+It's important to let work accumulate when more work is still incoming and we
+still have room, because flushing is always more efficient if we let it batch
+up. New writes may overwrite data before rebalance moves it, and tasks may be
+generating more updates for the btree nodes that journal reclaim needs to flush.
+
+On idle, how much work we do at each interval should be proportional to the
+length of time we have been idle for. If we're idle only for a short duration,
+we shouldn't flush everything right away; the system might wake up and start
+generating new work soon, and flushing immediately might end up doing a lot of
+work that would have been unnecessary if we'd allowed things to batch more.
+
+To summarize, we will need:
+
+ - A list of classes for background tasks that generate work, which will
+ include one "foreground" class.
+ - Tracking for each class - "Am I doing work, or have I gone to sleep?"
+ - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
index e2bd61ccd96f..e5c4c2120b93 100644
--- a/Documentation/filesystems/bcachefs/index.rst
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -4,8 +4,35 @@
bcachefs Documentation
======================
+Subsystem-specific development process notes
+--------------------------------------------
+
+Development notes specific to bcachefs. These are intended to supplement
+:doc:`general kernel development handbook </process/index>`.
+
.. toctree::
- :maxdepth: 2
+ :maxdepth: 1
:numbered:
+ CodingStyle
+ SubmittingPatches
+
+Filesystem implementation
+-------------------------
+
+Documentation for filesystem features and their implementation details.
+At this moment, only a few of these are described here.
+
+.. toctree::
+ :maxdepth: 1
+ :numbered:
+
+ casefolding
errorcodes
+
+Future design
+-------------
+.. toctree::
+ :maxdepth: 1
+
+ future/idle_work
diff --git a/Documentation/filesystems/buffer.rst b/Documentation/filesystems/buffer.rst
new file mode 100644
index 000000000000..ae24faf68efa
--- /dev/null
+++ b/Documentation/filesystems/buffer.rst
@@ -0,0 +1,12 @@
+Buffer Heads
+============
+
+Linux uses buffer heads to maintain state about individual filesystem blocks.
+Buffer heads are deprecated and new filesystems should use iomap instead.
+
+Functions
+---------
+
+.. kernel-doc:: include/linux/buffer_head.h
+.. kernel-doc:: fs/buffer.c
+ :export:
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst
index e04a27bdbe19..b3ccc782cb3b 100644
--- a/Documentation/filesystems/caching/cachefiles.rst
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available:
This mask can also be set through sysfs, eg::
- echo 5 >/sys/modules/cachefiles/parameters/debug
+ echo 5 > /sys/module/cachefiles/parameters/debug
Starting the Cache
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
index a74d7b052dc1..de1f32526cc1 100644
--- a/Documentation/filesystems/caching/fscache.rst
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -318,10 +318,10 @@ where the columns are:
Debugging
=========
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in::
+If CONFIG_NETFS_DEBUG is enabled, the FS-Cache facility and NETFS support can
+have runtime debugging enabled by adjusting the value in::
- /sys/module/fscache/parameters/debug
+ /sys/module/netfs/parameters/debug
This is a bitmask of debugging streams to enable:
@@ -343,6 +343,6 @@ This is a bitmask of debugging streams to enable:
The appropriate set of values should be OR'd together and the result written to
the control file. For example::
- echo $((1|8|512)) >/sys/module/fscache/parameters/debug
+ echo $((1|8|512)) >/sys/module/netfs/parameters/debug
will turn on all function entry debugging.
diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst
index 085f309ece60..6d2276a87a5a 100644
--- a/Documentation/filesystems/ceph.rst
+++ b/Documentation/filesystems/ceph.rst
@@ -67,12 +67,15 @@ Snapshot names have two limitations:
more than 255 characters, and `<node-id>` takes 13 characters, the long
snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
-Ceph also provides some recursive accounting on directories for nested
-files and bytes. That is, a 'getfattr -d foo' on any directory in the
-system will reveal the total number of nested regular files and
-subdirectories, and a summation of all nested file sizes. This makes
-the identification of large disk space consumers relatively quick, as
-no 'du' or similar recursive scan of the file system is required.
+Ceph also provides some recursive accounting on directories for nested files
+and bytes. You can run the commands::
+
+ getfattr -n ceph.dir.rfiles /some/dir
+ getfattr -n ceph.dir.rbytes /some/dir
+
+to get the total number of nested files and their combined size, respectively.
+This makes the identification of large disk space consumers relatively quick,
+as no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst
index bdde7e4e010b..0db3c83a50e5 100644
--- a/Documentation/filesystems/coda.rst
+++ b/Documentation/filesystems/coda.rst
@@ -141,7 +141,7 @@ kernel support.
a process P which accessing a Coda file. It makes a system call which
traps to the OS kernel. Examples of such calls trapping to the kernel
are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``,
- ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32
+ ``rmdir``, ``chmod`` in a Unix context. Similar calls exist in the Win32
environment, and are named ``CreateFile``.
Generally the operating system handles the request in a virtual
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
index 719e90f1988e..08dd5e254cc5 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -207,7 +207,6 @@ implement direct_access.
These block devices may be used for inspiration:
- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
- pmem: NVDIMM persistent memory driver
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index dc35da8b8792..55f807293924 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -211,18 +211,16 @@ seq_file content.
There are a couple of other directory-oriented helper functions::
- struct dentry *debugfs_rename(struct dentry *old_dir,
- struct dentry *old_dentry,
- struct dentry *new_dir,
- const char *new_name);
+ struct dentry *debugfs_change_name(struct dentry *dentry,
+ const char *fmt, ...);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
-A call to debugfs_rename() will give a new name to an existing debugfs
-file, possibly in a different directory. The new_name must not exist prior
-to the call; the return value is old_dentry with updated information.
+A call to debugfs_change_name() will give a new name to an existing debugfs
+file, always in the same directory. The new_name must not exist prior
+to the call; the return value is 0 on success and -E... on failure.
Symbolic links can be created with debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
@@ -231,22 +229,15 @@ module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
-can be removed with::
+or directory can be removed with::
void debugfs_remove(struct dentry *dentry);
The dentry value can be NULL or an error value, in which case nothing will
-be removed.
-
-Once upon a time, debugfs users were required to remember the dentry
-pointer for every debugfs file they created so that all files could be
-cleaned up. We live in more civilized times now, though, and debugfs users
-can call::
-
- void debugfs_remove_recursive(struct dentry *dentry);
-
-If this function is passed a pointer for the dentry corresponding to the
-top-level directory, the entire hierarchy below that directory will be
-removed.
+be removed. Note that this function will recursively remove all files and
+directories underneath it. Previously, debugfs_remove_recursive() was used
+to perform that task, but this function is now just an alias to
+debugfs_remove(). debugfs_remove_recursive() should be considered
+deprecated.
.. [1] http://lwn.net/Articles/309298/
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index 05ea387bc9fb..6fdf0b02df43 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -44,7 +44,7 @@ For our purposes all operations fall in 6 classes:
* decide which of the source and target need to be locked.
The source needs to be locked if it's a non-directory, target - if it's
a non-directory or about to be removed.
- * take the locks that need to be taken (exlusive), in inode pointer order
+ * take the locks that need to be taken (exclusive), in inode pointer order
if need to take both (that can happen only when both source and target
are non-directories - the source because it wouldn't need to be locked
otherwise and the target because mixing directory and non-directory is
@@ -234,7 +234,7 @@ among the children, in some order. But that is also impossible, since
neither of the children is a descendent of another.
That concludes the proof, since the set of operations with the
-properties requiered for a minimal deadlock can not exist.
+properties required for a minimal deadlock can not exist.
Note that the check for having a common ancestor in cross-directory
rename is crucial - without it a deadlock would be possible. Indeed,
diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst
index 7e2b1fd471d7..70d4e48242c3 100644
--- a/Documentation/filesystems/dlmfs.rst
+++ b/Documentation/filesystems/dlmfs.rst
@@ -36,7 +36,7 @@ None
Usage
=====
-If you're just interested in OCFS2, then please see ocfs2.txt. The
+If you're just interested in OCFS2, then please see ocfs2.rst. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 0551985821b8..f646c3f0980f 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -40,4 +40,4 @@ accidentally.
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
-- Documentation/ABI/stable/sysfs-firmware-efi-vars
+- Documentation/ABI/removed/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index cc4626d6ee4f..7ddb235aee9d 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -75,7 +75,7 @@ Here are the main features of EROFS:
- Support merging tail-end data into a special inode as fragments.
- - Support large folios for uncompressed files.
+ - Support large folios to make use of THPs (Transparent Hugepages);
- Support direct I/O on uncompressed files to avoid double caching for loop
devices;
@@ -128,6 +128,7 @@ device=%s Specify a path to an extra device to be used together.
fsid=%s Specify a filesystem image ID for Fscache back-end.
domain_id=%s Specify a domain ID in fscache mode so that different images
with the same blobs under a given domain ID can share storage.
+fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
=================== =========================================================
Sysfs Entries
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..f65767df3620
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ EXT4's supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ EXT4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit. Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent by writing zeroes to it and converting any unwritten extents to
+ written, if found within the range.
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+------
+
+Creating Filesystems with Atomic Write Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+~~~~~~~~~~~~~~~~~~~~~
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+----------------
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+--------
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 0fad6eda6e15..9d4054c17ecb 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -25,3 +25,4 @@ order.
.. include:: inlinedata.rst
.. include:: eainode.rst
.. include:: verity.rst
+.. include:: atomic_writes.rst
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index a1eb4a11a1d0..1b240661bfa3 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -328,9 +328,13 @@ The ext4 superblock is laid out as follows in
- s_checksum_type
- Metadata checksum algorithm type. The only valid value is 1 (crc32c).
* - 0x176
- - __le16
- - s_reserved_pad
- -
+ - \_\_u8
+ - s\_encryption\_level
+ - Versioning level for encryption.
+ * - 0x177
+ - \_\_u8
+ - s\_reserved\_pad
+ - Padding to next 32bits.
* - 0x178
- __le64
- s_kbytes_written
@@ -466,9 +470,13 @@ The ext4 superblock is laid out as follows in
- s_last_error_time_hi
- Upper 8 bits of the s_last_error_time field.
* - 0x27A
- - __u8
- - s_pad[2]
- - Zero padding.
+ - \_\_u8
+ - s\_first\_error\_errcode
+ -
+ * - 0x27B
+ - \_\_u8
+ - s\_last\_error\_errcode
+ -
* - 0x27C
- __le16
- s_encoding
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index efc3493fd6f8..440e4ae74e44 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -182,31 +182,34 @@ fault_type=%d Support configuring fault injection type, should be
enabled with fault_injection option, fault type value
is shown below, it supports single or combined type.
- =========================== ===========
+ =========================== ==========
Type_Name Type_Value
- =========================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010 (obsolete)
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- FAULT_SLAB_ALLOC 0x000008000
- FAULT_DQUOT_INIT 0x000010000
- FAULT_LOCK_OP 0x000020000
- FAULT_BLKADDR_VALIDITY 0x000040000
- FAULT_BLKADDR_CONSISTENCE 0x000080000
- FAULT_NO_SEGMENT 0x000100000
- =========================== ===========
+ =========================== ==========
+ FAULT_KMALLOC 0x00000001
+ FAULT_KVMALLOC 0x00000002
+ FAULT_PAGE_ALLOC 0x00000004
+ FAULT_PAGE_GET 0x00000008
+ FAULT_ALLOC_BIO 0x00000010 (obsolete)
+ FAULT_ALLOC_NID 0x00000020
+ FAULT_ORPHAN 0x00000040
+ FAULT_BLOCK 0x00000080
+ FAULT_DIR_DEPTH 0x00000100
+ FAULT_EVICT_INODE 0x00000200
+ FAULT_TRUNCATE 0x00000400
+ FAULT_READ_IO 0x00000800
+ FAULT_CHECKPOINT 0x00001000
+ FAULT_DISCARD 0x00002000
+ FAULT_WRITE_IO 0x00004000
+ FAULT_SLAB_ALLOC 0x00008000
+ FAULT_DQUOT_INIT 0x00010000
+ FAULT_LOCK_OP 0x00020000
+ FAULT_BLKADDR_VALIDITY 0x00040000
+ FAULT_BLKADDR_CONSISTENCE 0x00080000
+ FAULT_NO_SEGMENT 0x00100000
+ FAULT_INCONSISTENT_FOOTER 0x00200000
+ FAULT_TIMEOUT 0x00400000 (1000ms)
+ FAULT_VMALLOC 0x00800000
+ =========================== ==========
mode=%s Control block allocation mode which supports "adaptive"
and "lfs". In "lfs" mode, there should be no random
writes towards main area.
@@ -365,6 +368,8 @@ errors=%s Specify f2fs behavior on critical errors. This supports modes:
pending node write drop keep N/A
pending meta write keep keep N/A
====================== =============== =============== ========
+nat_bits Enable nat_bits feature to enhance full/empty nat blocks access,
+ by default it's disabled.
======================== ============================================================
Debugfs Entries
@@ -774,6 +779,35 @@ In order to identify whether the data in the victim segment are valid or not,
F2FS manages a bitmap. Each bit represents the validity of a block, and the
bitmap is composed of a bit stream covering whole blocks in main area.
+Write-hint Policy
+-----------------
+
+F2FS sets the whint all the time with the below policy.
+
+===================== ======================== ===================
+User F2FS Block
+===================== ======================== ===================
+N/A META WRITE_LIFE_NONE|REQ_META
+N/A HOT_NODE WRITE_LIFE_NONE
+N/A WARM_NODE WRITE_LIFE_MEDIUM
+N/A COLD_NODE WRITE_LIFE_LONG
+ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
+extension list " "
+
+-- buffered io
+N/A COLD_DATA WRITE_LIFE_EXTREME
+N/A HOT_DATA WRITE_LIFE_SHORT
+N/A WARM_DATA WRITE_LIFE_NOT_SET
+
+-- direct io
+WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE " WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG " WRITE_LIFE_LONG
+===================== ======================== ===================
+
Fallocate(2) Policy
-------------------
@@ -914,3 +948,47 @@ NVMe Zoned Namespace devices
can start before the zone-capacity and span across zone-capacity boundary.
Such spanning segments are also considered as usable segments. All blocks
past the zone-capacity are considered unusable in these segments.
+
+Device aliasing feature
+-----------------------
+
+f2fs can utilize a special file called a "device aliasing file." This file allows
+the entire storage device to be mapped with a single, large extent, not using
+the usual f2fs node structures. This mapped area is pinned and primarily intended
+for holding the space.
+
+Essentially, this mechanism allows a portion of the f2fs area to be temporarily
+reserved and used by another filesystem or for different purposes. Once that
+external usage is complete, the device aliasing file can be deleted, releasing
+the reserved space back to F2FS for its own use.
+
+<use-case>
+
+# ls /dev/vd*
+/dev/vdb (32GB) /dev/vdc (32GB)
+# mkfs.ext4 /dev/vdc
+# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
+# mount /dev/vdb /mnt/f2fs
+# ls -l /mnt/f2fs
+vdc.file
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+
+# mount -o loop /dev/vdc /mnt/ext4
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+/dev/loop7 32G 24K 30G 1% /mnt/ext4
+# umount /mnt/ext4
+
+# f2fs_io getflags /mnt/f2fs/vdc.file
+get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
+# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
+get a flag on noimmutable ret=0, flags=800010
+set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
+# rm /mnt/f2fs/vdc.file
+# df -h
+/dev/vdb 64G 753M 64G 2% /mnt/f2fs
+
+So, the key idea is, user can do any file operations on /dev/vdc, and
+reclaim the space after the use, while the space is counted as /data.
+That doesn't require modifying partition size and filesystem format.
diff --git a/Documentation/filesystems/fiemap.rst b/Documentation/filesystems/fiemap.rst
index 93fc96f760aa..23b3ed229e49 100644
--- a/Documentation/filesystems/fiemap.rst
+++ b/Documentation/filesystems/fiemap.rst
@@ -12,21 +12,10 @@ returns a list of extents.
Request Basics
--------------
-A fiemap request is encoded within struct fiemap::
-
- struct fiemap {
- __u64 fm_start; /* logical offset (inclusive) at
- * which to start mapping (in) */
- __u64 fm_length; /* logical length of mapping which
- * userspace cares about (in) */
- __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
- __u32 fm_mapped_extents; /* number of extents that were
- * mapped (out) */
- __u32 fm_extent_count; /* size of fm_extents array (in) */
- __u32 fm_reserved;
- struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
- };
+A fiemap request is encoded within struct fiemap:
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap
fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
@@ -60,6 +49,8 @@ FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_CACHE
+ This flag requests caching of the extents.
Extent Mapping
--------------
@@ -77,18 +68,10 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents::
-
- struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
- };
+returned in fm_extents:
+
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap_extent
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -175,6 +158,8 @@ FIEMAP_EXTENT_MERGED
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.
+FIEMAP_EXTENT_SHARED
+ This flag is set to request that space be shared with other files.
VFS -> File System Implementation
---------------------------------
@@ -191,14 +176,10 @@ each discovered extent::
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request::
-
- struct fiemap_extent_info {
- unsigned int fi_flags; /* Flags as passed from user */
- unsigned int fi_extents_mapped; /* Number of mapped extents */
- unsigned int fi_extents_max; /* Size of fiemap_extent array */
- struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
- };
+fiemap request:
+
+.. kernel-doc:: include/linux/fiemap.h
+ :identifiers: fiemap_extent_info
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 04eaab01314b..29e84d125e02 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -70,7 +70,7 @@ Online attacks
--------------
fscrypt (and storage encryption in general) can only provide limited
-protection, if any at all, against online attacks. In detail:
+protection against online attacks. In detail:
Side-channel attacks
~~~~~~~~~~~~~~~~~~~~
@@ -99,16 +99,23 @@ Therefore, any encryption-specific access control checks would merely
be enforced by kernel *code* and therefore would be largely redundant
with the wide variety of access control mechanisms already available.)
-Kernel memory compromise
-~~~~~~~~~~~~~~~~~~~~~~~~
+Read-only kernel memory compromise
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unless `hardware-wrapped keys`_ are used, an attacker who gains the
+ability to read from arbitrary kernel memory, e.g. by mounting a
+physical attack or by exploiting a kernel security vulnerability, can
+compromise all fscrypt keys that are currently in-use. This also
+extends to cold boot attacks; if the system is suddenly powered off,
+keys the system was using may remain in memory for a short time.
-An attacker who compromises the system enough to read from arbitrary
-memory, e.g. by mounting a physical attack or by exploiting a kernel
-security vulnerability, can compromise all encryption keys that are
-currently in use.
+However, if hardware-wrapped keys are used, then the fscrypt master
+keys and file contents encryption keys (but not other types of fscrypt
+subkeys such as filenames encryption keys) are protected from
+compromises of arbitrary kernel memory.
-However, fscrypt allows encryption keys to be removed from the kernel,
-which may protect them from later compromise.
+In addition, fscrypt allows encryption keys to be removed from the
+kernel, which may protect them from later compromise.
In more detail, the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (or the
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl) can wipe a master
@@ -137,14 +144,31 @@ However, these ioctls have some limitations:
- In general, decrypted contents and filenames in the kernel VFS
caches are freed but not wiped. Therefore, portions thereof may be
recoverable from freed memory, even after the corresponding key(s)
- were wiped. To partially solve this, you can set
- CONFIG_PAGE_POISONING=y in your kernel config and add page_poison=1
- to your kernel command line. However, this has a performance cost.
+ were wiped. To partially solve this, you can add init_on_free=1 to
+ your kernel command line. However, this has a performance cost.
- Secret keys might still exist in CPU registers, in crypto
accelerator hardware (if used by the crypto API to implement any of
the algorithms), or in other places not explicitly considered here.
+Full system compromise
+~~~~~~~~~~~~~~~~~~~~~~
+
+An attacker who gains "root" access and/or the ability to execute
+arbitrary kernel code can freely exfiltrate data that is protected by
+any in-use fscrypt keys. Thus, usually fscrypt provides no meaningful
+protection in this scenario. (Data that is protected by a key that is
+absent throughout the entire attack remains protected, modulo the
+limitations of key removal mentioned above in the case where the key
+was removed prior to the attack.)
+
+However, if `hardware-wrapped keys`_ are used, such attackers will be
+unable to exfiltrate the master keys or file contents keys in a form
+that will be usable after the system is powered off. This may be
+useful if the attacker is significantly time-limited and/or
+bandwidth-limited, so they can only exfiltrate some data and need to
+rely on a later offline attack to exfiltrate the rest of it.
+
Limitations of v1 policies
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -171,6 +195,10 @@ policies on all new encrypted directories.
Key hierarchy
=============
+Note: this section assumes the use of raw keys rather than
+hardware-wrapped keys. The use of hardware-wrapped keys modifies the
+key hierarchy slightly. For details, see `Hardware-wrapped keys`_.
+
Master Keys
-----------
@@ -428,11 +456,8 @@ API, but the filenames mode still does.
- Mandatory:
- CONFIG_CRYPTO_ADIANTUM
- Recommended:
- - arm32: CONFIG_CRYPTO_CHACHA20_NEON
- arm32: CONFIG_CRYPTO_NHPOLY1305_NEON
- - arm64: CONFIG_CRYPTO_CHACHA20_NEON
- arm64: CONFIG_CRYPTO_NHPOLY1305_NEON
- - x86: CONFIG_CRYPTO_CHACHA20_X86_64
- x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
- x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
@@ -836,7 +861,9 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
__u32 key_id;
- __u32 __reserved[8];
+ #define FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED 0x00000001
+ __u32 flags;
+ __u32 __reserved[7];
__u8 raw[];
};
@@ -855,7 +882,7 @@ a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_provisioning_key_payload {
__u32 type;
- __u32 __reserved;
+ __u32 flags;
__u8 raw[];
};
@@ -883,24 +910,32 @@ as follows:
Alternatively, if ``key_id`` is nonzero, this field must be 0, since
in that case the size is implied by the specified Linux keyring key.
-- ``key_id`` is 0 if the raw key is given directly in the ``raw``
- field. Otherwise ``key_id`` is the ID of a Linux keyring key of
- type "fscrypt-provisioning" whose payload is
- struct fscrypt_provisioning_key_payload whose ``raw`` field contains
- the raw key and whose ``type`` field matches ``key_spec.type``.
- Since ``raw`` is variable-length, the total size of this key's
- payload must be ``sizeof(struct fscrypt_provisioning_key_payload)``
- plus the raw key size. The process must have Search permission on
- this key.
-
- Most users should leave this 0 and specify the raw key directly.
- The support for specifying a Linux keyring key is intended mainly to
+- ``key_id`` is 0 if the key is given directly in the ``raw`` field.
+ Otherwise ``key_id`` is the ID of a Linux keyring key of type
+ "fscrypt-provisioning" whose payload is struct
+ fscrypt_provisioning_key_payload whose ``raw`` field contains the
+ key, whose ``type`` field matches ``key_spec.type``, and whose
+ ``flags`` field matches ``flags``. Since ``raw`` is
+ variable-length, the total size of this key's payload must be
+ ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the number
+ of key bytes. The process must have Search permission on this key.
+
+ Most users should leave this 0 and specify the key directly. The
+ support for specifying a Linux keyring key is intended mainly to
allow re-adding keys after a filesystem is unmounted and re-mounted,
- without having to store the raw keys in userspace memory.
+ without having to store the keys in userspace memory.
+
+- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
+
+ - FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED: This denotes that the key is a
+ hardware-wrapped key. See `Hardware-wrapped keys`_. This flag
+ can't be used if FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR is used.
- ``raw`` is a variable-length field which must contain the actual
key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
- nonzero, then this field is unused.
+ nonzero, then this field is unused. Note that despite being named
+ ``raw``, if FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED is specified then it
+ will contain a wrapped key, not a raw key.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -912,8 +947,8 @@ prevent that other user from unexpectedly removing it. Therefore,
FS_IOC_ADD_ENCRYPTION_KEY may also be used to add a v2 policy key
*again*, even if it's already added by other user(s). In this case,
FS_IOC_ADD_ENCRYPTION_KEY will just install a claim to the key for the
-current user, rather than actually add the key again (but the raw key
-must still be provided, as a proof of knowledge).
+current user, rather than actually add the key again (but the key must
+still be provided, as a proof of knowledge).
FS_IOC_ADD_ENCRYPTION_KEY returns 0 if either the key or a claim to
the key was either added or already exists.
@@ -922,20 +957,23 @@ FS_IOC_ADD_ENCRYPTION_KEY can fail with the following errors:
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace; or the raw key was specified by Linux key ID but the
+ user namespace; or the key was specified by Linux key ID but the
process lacks Search permission on the key.
+- ``EBADMSG``: invalid hardware-wrapped key
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
-- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
- key has the wrong type
-- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
- exists with that ID
+- ``EKEYREJECTED``: the key was specified by Linux key ID, but the key
+ has the wrong type
+- ``ENOKEY``: the key was specified by Linux key ID, but no key exists
+ with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
- had encryption enabled on it
+ had encryption enabled on it; or a hardware wrapped key was specified
+ but the filesystem does not support inline encryption or the hardware
+ does not support hardware-wrapped keys
Legacy method
~~~~~~~~~~~~~
@@ -998,9 +1036,8 @@ or removed by non-root users.
These ioctls don't work on keys that were added via the legacy
process-subscribed keyrings mechanism.
-Before using these ioctls, read the `Kernel memory compromise`_
-section for a discussion of the security goals and limitations of
-these ioctls.
+Before using these ioctls, read the `Online attacks`_ section for a
+discussion of the security goals and limitations of these ioctls.
FS_IOC_REMOVE_ENCRYPTION_KEY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1320,7 +1357,8 @@ inline encryption hardware doesn't have the needed crypto capabilities
(e.g. support for the needed encryption algorithm and data unit size)
and where blk-crypto-fallback is unusable. (For blk-crypto-fallback
to be usable, it must be enabled in the kernel configuration with
-CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y.)
+CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y, and the file must be
+protected by a raw key rather than a hardware-wrapped key.)
Currently fscrypt always uses the filesystem block size (which is
usually 4096 bytes) as the data unit size. Therefore, it can only use
@@ -1328,7 +1366,76 @@ inline encryption hardware that supports that data unit size.
Inline encryption doesn't affect the ciphertext or other aspects of
the on-disk format, so users may freely switch back and forth between
-using "inlinecrypt" and not using "inlinecrypt".
+using "inlinecrypt" and not using "inlinecrypt". An exception is that
+files that are protected by a hardware-wrapped key can only be
+encrypted/decrypted by the inline encryption hardware and therefore
+can only be accessed when the "inlinecrypt" mount option is used. For
+more information about hardware-wrapped keys, see below.
+
+Hardware-wrapped keys
+---------------------
+
+fscrypt supports using *hardware-wrapped keys* when the inline
+encryption hardware supports it. Such keys are only present in kernel
+memory in wrapped (encrypted) form; they can only be unwrapped
+(decrypted) by the inline encryption hardware and are temporally bound
+to the current boot. This prevents the keys from being compromised if
+kernel memory is leaked. This is done without limiting the number of
+keys that can be used and while still allowing the execution of
+cryptographic tasks that are tied to the same key but can't use inline
+encryption hardware, e.g. filenames encryption.
+
+Note that hardware-wrapped keys aren't specific to fscrypt; they are a
+block layer feature (part of *blk-crypto*). For more details about
+hardware-wrapped keys, see the block layer documentation at
+:ref:`Documentation/block/inline-encryption.rst
+<hardware_wrapped_keys>`. The rest of this section just focuses on
+the details of how fscrypt can use hardware-wrapped keys.
+
+fscrypt supports hardware-wrapped keys by allowing the fscrypt master
+keys to be hardware-wrapped keys as an alternative to raw keys. To
+add a hardware-wrapped key with `FS_IOC_ADD_ENCRYPTION_KEY`_,
+userspace must specify FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED in the
+``flags`` field of struct fscrypt_add_key_arg and also in the
+``flags`` field of struct fscrypt_provisioning_key_payload when
+applicable. The key must be in ephemerally-wrapped form, not
+long-term wrapped form.
+
+Some limitations apply. First, files protected by a hardware-wrapped
+key are tied to the system's inline encryption hardware. Therefore
+they can only be accessed when the "inlinecrypt" mount option is used,
+and they can't be included in portable filesystem images. Second,
+currently the hardware-wrapped key support is only compatible with
+`IV_INO_LBLK_64 policies`_ and `IV_INO_LBLK_32 policies`_, as it
+assumes that there is just one file contents encryption key per
+fscrypt master key rather than one per file. Future work may address
+this limitation by passing per-file nonces down the storage stack to
+allow the hardware to derive per-file keys.
+
+Implementation-wise, to encrypt/decrypt the contents of files that are
+protected by a hardware-wrapped key, fscrypt uses blk-crypto,
+attaching the hardware-wrapped key to the bio crypt contexts. As is
+the case with raw keys, the block layer will program the key into a
+keyslot when it isn't already in one. However, when programming a
+hardware-wrapped key, the hardware doesn't program the given key
+directly into a keyslot but rather unwraps it (using the hardware's
+ephemeral wrapping key) and derives the inline encryption key from it.
+The inline encryption key is the key that actually gets programmed
+into a keyslot, and it is never exposed to software.
+
+However, fscrypt doesn't just do file contents encryption; it also
+uses its master keys to derive filenames encryption keys, key
+identifiers, and sometimes some more obscure types of subkeys such as
+dirhash keys. So even with file contents encryption out of the
+picture, fscrypt still needs a raw key to work with. To get such a
+key from a hardware-wrapped key, fscrypt asks the inline encryption
+hardware to derive a cryptographically isolated "software secret" from
+the hardware-wrapped key. fscrypt uses this "software secret" to key
+its KDF to derive all subkeys other than file contents keys.
+
+Note that this implies that the hardware-wrapped key feature only
+protects the file contents encryption keys. It doesn't protect other
+fscrypt subkeys such as filenames encryption keys.
Direct I/O support
==================
@@ -1413,7 +1520,7 @@ read the ciphertext into the page cache and decrypt it in-place. The
folio lock must be held until decryption has finished, to prevent the
folio from becoming visible to userspace prematurely.
-For the write path (->writepage()) of regular files, filesystems
+For the write path (->writepages()) of regular files, filesystems
cannot encrypt data in-place in the page cache, since the cached
plaintext must be preserved. Instead, filesystems must encrypt into a
temporary buffer or "bounce page", then write out the temporary
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 13e4b18e5dbb..dacdbc1149e6 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -16,7 +16,7 @@ btrfs filesystems. Like fscrypt, not too much filesystem-specific
code is needed to support fs-verity.
fs-verity is similar to `dm-verity
-<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+<https://www.kernel.org/doc/Documentation/admin-guide/device-mapper/verity.rst>`_
but works on files rather than block devices. On regular files on
filesystems supporting fs-verity, userspace can execute an ioctl that
causes the filesystem to build a Merkle tree for the file and persist
@@ -86,6 +86,16 @@ authenticating fs-verity file hashes include:
signature in their "security.ima" extended attribute, as controlled
by the IMA policy. For more information, see the IMA documentation.
+- Integrity Policy Enforcement (IPE). IPE supports enforcing access
+ control decisions based on immutable security properties of files,
+ including those protected by fs-verity's built-in signatures.
+ "IPE policy" specifically allows for the authorization of fs-verity
+ files using properties ``fsverity_digest`` for identifying
+ files by their verity digest, and ``fsverity_signature`` to authorize
+ files with a verified fs-verity's built-in signature. For
+ details on configuring IPE policies and understanding its operational
+ modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`.
+
- Trusted userspace code in combination with `Built-in signature
verification`_. This approach should be used only with great care.
@@ -238,11 +248,17 @@ FS_IOC_READ_VERITY_METADATA
The FS_IOC_READ_VERITY_METADATA ioctl reads verity metadata from a
verity file. This ioctl is available since Linux v5.12.
-This ioctl allows writing a server program that takes a verity file
-and serves it to a client program, such that the client can do its own
-fs-verity compatible verification of the file. This only makes sense
-if the client doesn't trust the server and if the server needs to
-provide the storage for the client.
+This ioctl is useful for cases where the verity verification should be
+performed somewhere other than the currently running kernel.
+
+One example is a server program that takes a verity file and serves it
+to a client program, such that the client can do its own fs-verity
+compatible verification of the file. This only makes sense if the
+client doesn't trust the server and if the server needs to provide the
+storage for the client.
+
+Another example is copying verity metadata when creating filesystem
+images in userspace (such as with ``mkfs.ext4 -d``).
This is a fairly specialized use case, and most fs-verity users won't
need this ioctl.
@@ -457,7 +473,11 @@ Enabling this option adds the following:
On success, the ioctl persists the signature alongside the Merkle
tree. Then, any time the file is opened, the kernel verifies the
file's actual digest against this signature, using the certificates
- in the ".fs-verity" keyring.
+ in the ".fs-verity" keyring. This verification happens as long as the
+ file's signature exists, regardless of the state of the sysctl variable
+ "fs.verity.require_signatures" described in the next item. The IPE LSM
+ relies on this behavior to recognize and label fsverity files
+ that contain a verified built-in fsverity signature.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
@@ -481,7 +501,7 @@ be carefully considered before using them:
- Builtin signature verification does *not* make the kernel enforce
that any files actually have fs-verity enabled. Thus, it is not a
- complete authentication policy. Currently, if it is used, the only
+ complete authentication policy. Currently, if it is used, one
way to complete the authentication policy is for trusted userspace
code to explicitly check whether files have fs-verity enabled with a
signature before they are accessed. (With
@@ -490,6 +510,15 @@ be carefully considered before using them:
could just store the signature alongside the file and verify it
itself using a cryptographic library, instead of using this feature.
+- Another approach is to utilize fs-verity builtin signature
+ verification in conjunction with the IPE LSM, which supports defining
+ a kernel-enforced, system-wide authentication policy that allows only
+ files with a verified fs-verity builtin signature to perform certain
+ operations, such as execution. Note that IPE doesn't require
+ fs.verity.require_signatures=1.
+ Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for
+ more details.
+
- A file's builtin signature can only be set at the same time that
fs-verity is being enabled on the file. Changing or deleting the
builtin signature later requires re-creating the file.
diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
new file mode 100644
index 000000000000..d73dd0dbd238
--- /dev/null
+++ b/Documentation/filesystems/fuse-io-uring.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+FUSE-over-io-uring design documentation
+=======================================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through io-uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through io-uring, userspace
+is required to also handle requests through /dev/fuse after io-uring setup
+is complete. Specifically notifications (initiated from the daemon side)
+and interrupts.
+
+Fuse io-uring configuration
+===========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until io-uring setup is complete.
+
+In order to set up fuse-over-io-uring fuse-server (user-space)
+needs to submit SQEs (opcode = IORING_OP_URING_CMD) to the /dev/fuse
+connection file descriptor. Initial submit is with the sub command
+FUSE_URING_REQ_REGISTER, which will just register entries to be
+available in the kernel.
+
+Once at least one entry per queue is submitted, kernel starts
+to enqueue to ring queues.
+Note, every CPU core has its own fuse-io-uring queue.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------::
+
+ | | FUSE filesystem daemon
+ | |
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD /
+ | | FUSE_URING_CMD_REGISTER
+ | | [wait cqe]
+ | | >io_uring_wait_cqe() or
+ | | >io_uring_submit_and_wait()
+ | |
+ | >fuse_uring_cmd() |
+ | >fuse_uring_register() |
+
+
+Sending requests with CQEs
+--------------------------::
+
+ | | FUSE filesystem daemon
+ | | [waiting for CQEs]
+ | "rm /mnt/fuse/file" |
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [allocate request] |
+ | >fuse_send_one() |
+ | ... |
+ | >fuse_uring_queue_fuse_req |
+ | [queue request on fg queue] |
+ | >fuse_uring_add_req_to_ring_ent() |
+ | ... |
+ | >fuse_uring_copy_to_ring() |
+ | >io_uring_cmd_done() |
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | [receives and handles CQE]
+ | | [submit result and fetch next]
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD/
+ | | FUSE_URING_CMD_COMMIT_AND_FETCH
+ | >fuse_uring_cmd() |
+ | >fuse_uring_commit_fetch() |
+ | >fuse_uring_commit() |
+ | >fuse_uring_copy_from_ring() |
+ | [ copy the result to the fuse req] |
+ | >fuse_uring_req_end() |
+ | >fuse_request_end() |
+ | [wake up req->waitq] |
+ | >fuse_uring_next_fuse_req |
+ | [wait or handle next req] |
+ | |
+ | [req->waitq woken up] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+
+
diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2-glocks.rst
index 8a5842929b60..adc0d4c4d979 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -40,14 +40,14 @@ shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-========== ========== ============== ========== ==============
-Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
-========== ========== ============== ========== ==============
- UN No No No No
- SH Yes Yes No No
- DF No Yes No No
- EX Yes Yes Yes Yes
-========== ========== ============== ========== ==============
+========== ============== ========== ========== ==============
+Glock mode Cache Metadata Cache data Dirty Data Dirty Metadata
+========== ============== ========== ========== ==============
+ UN No No No No
+ DF Yes No No No
+ SH Yes Yes No No
+ EX Yes Yes Yes Yes
+========== ============== ========== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -55,23 +55,22 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-============= =============================================================
+============== =============================================================
Field Purpose
-============= =============================================================
-go_xmote_th Called before remote state change (e.g. to sync dirty data)
+============== =============================================================
+go_sync Called before remote state change (e.g. to sync dirty data)
go_xmote_bh Called after remote state change (e.g. to refill cache)
go_inval Called if remote state change requires invalidating the cache
-go_demote_ok Returns boolean value of whether its ok to demote a glock
- (e.g. checks timeout, and that there is no cached data)
-go_lock Called for the first local holder of a lock
-go_unlock Called on the final local unlock of a lock
+go_instantiate Called when a glock has been acquired
+go_held Called every time a glock holder is acquired
go_dump Called to print content of object for debugfs file, or on
error to dump glock to the log.
-go_type The type of the glock, ``LM_TYPE_*``
go_callback Called if the DLM sends a callback to drop this lock
+go_unlocked Called when a glock is unlocked (dlm_unlock())
+go_type The type of the glock, ``LM_TYPE_*``
go_flags GLOF_ASPACE is set, if the glock has an address space
associated with it
-============= =============================================================
+============== =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
@@ -82,26 +81,24 @@ to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
-There is a plan to try and remove the go_lock and go_unlock callbacks
-if possible, in order to try and speed up the fast path though the locking.
-Also, eventually we hope to make the glock "EX" mode locally shared
-such that any local locking will be done with the i_mutex as required
-rather than via the glock.
+Eventually, we hope to make the glock "EX" mode locally shared such that any
+local locking will be done with the i_mutex as required rather than via the
+glock.
Locking rules for glock operations:
-============= ====================== =============================
+============== ====================== =============================
Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
-============= ====================== =============================
-go_xmote_th Yes No
+============== ====================== =============================
+go_sync Yes No
go_xmote_bh Yes No
go_inval Yes No
-go_demote_ok Sometimes Yes
-go_lock Yes No
-go_unlock Yes No
+go_instantiate No No
+go_held No No
go_dump Sometimes Yes
go_callback Sometimes (N/A) Yes
-============= ====================== =============================
+go_unlocked Yes No
+============== ====================== =============================
.. Note::
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
index ac0af679e61e..2a206129f828 100644
--- a/Documentation/filesystems/idmappings.rst
+++ b/Documentation/filesystems/idmappings.rst
@@ -63,8 +63,8 @@ what id ``k11000`` corresponds to in the second or third idmapping. The
straightforward algorithm to use is to apply the inverse of the first idmapping,
mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
either the second idmapping mapping or third idmapping mapping. The second
-idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
-``u1000`` down to ``u31000``.
+idmapping would map ``u1000`` down to ``k21000``. The third idmapping would map
+``u1000`` down to ``k31000``.
If we were given the same task for the following three idmappings::
@@ -821,7 +821,7 @@ the same idmapping to the mount. We now perform three steps:
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k20000:r10000, k21000) = u1000
@@ -854,10 +854,10 @@ The same translation algorithm works with the third example.
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
- from_kuid(u0:k0:r4294967295, k21000) = u1000
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
So the ownership that lands on disk will be ``u1000``.
@@ -994,7 +994,7 @@ from above:::
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k0:r4294967295, k1000) = u1000
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1f9b4c905a6a..32618512a965 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,11 +29,13 @@ algorithms work.
fiemap
files
locks
+ multigrain-ts
mount_api
quota
seq_file
sharedsubtree
idmappings
+ iomap/index
automount-support
@@ -50,6 +52,7 @@ filesystem implementations.
.. toctree::
:maxdepth: 2
+ buffer
journalling
fscrypt
fsverity
@@ -95,6 +98,7 @@ Documentation for filesystem implementations.
hpfs
fuse
fuse-io
+ fuse-io-uring
inotify
isofs
nilfs2
@@ -109,12 +113,12 @@ Documentation for filesystem implementations.
qnx6
ramfs-rootfs-initramfs
relay
+ resctrl
romfs
smb/index
spufs/index
squashfs
sysfs
- sysv-fs
tmpfs
ubifs
ubifs-authentication
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
new file mode 100644
index 000000000000..f2df9b6df988
--- /dev/null
+++ b/Documentation/filesystems/iomap/design.rst
@@ -0,0 +1,462 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_design:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+==============
+Library Design
+==============
+
+.. contents:: Table of Contents
+ :local:
+
+Introduction
+============
+
+iomap is a filesystem library for handling common file operations.
+The library has two layers:
+
+ 1. A lower layer that provides an iterator over ranges of file offsets.
+ This layer tries to obtain mappings of each file ranges to storage
+ from the filesystem, but the storage information is not necessarily
+ required.
+
+ 2. An upper layer that acts upon the space mappings provided by the
+ lower layer iterator.
+
+The iteration can involve mappings of file's logical offset ranges to
+physical extents, but the storage layer information is not necessarily
+required, e.g. for walking cached file information.
+The library exports various APIs for implementing file operations such
+as:
+
+ * Pagecache reads and writes
+ * Folio write faults to the pagecache
+ * Writeback of dirty folios
+ * Direct I/O reads and writes
+ * fsdax I/O reads, writes, loads, and stores
+ * FIEMAP
+ * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
+ * swapfile activation
+
+This origins of this library is the file I/O path that XFS once used; it
+has now been extended to cover several other operations.
+
+Who Should Read This?
+=====================
+
+The target audience for this document are filesystem, storage, and
+pagecache programmers and code reviewers.
+
+If you are working on PCI, machine architectures, or device drivers, you
+are most likely in the wrong place.
+
+How Is This Better?
+===================
+
+Unlike the classic Linux I/O model which breaks file I/O into small
+units (generally memory pages or blocks) and looks up space mappings on
+the basis of that unit, the iomap model asks the filesystem for the
+largest space mappings that it can create for a given file operation and
+initiates operations on that basis.
+This strategy improves the filesystem's visibility into the size of the
+operation being performed, which enables it to combat fragmentation with
+larger space allocations when possible.
+Larger space mappings improve runtime performance by amortizing the cost
+of mapping function calls into the filesystem across a larger amount of
+data.
+
+At a high level, an iomap operation `looks like this
+<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
+
+1. For each byte in the operation range...
+
+ 1. Obtain a space mapping via ``->iomap_begin``
+
+ 2. For each sub-unit of work...
+
+ 1. Revalidate the mapping and go back to (1) above, if necessary.
+ So far only the pagecache operations need to do this.
+
+ 2. Do the work
+
+ 3. Increment operation cursor
+
+ 4. Release the mapping via ``->iomap_end``, if necessary
+
+Each iomap operation will be covered in more detail below.
+This library was covered previously by an `LWN article
+<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
+<https://kernelnewbies.org/KernelProjects/iomap>`_.
+
+The goal of this document is to provide a brief discussion of the
+design and capabilities of iomap, followed by a more detailed catalog
+of the interfaces presented by iomap.
+If you change iomap, please update this design document.
+
+File Range Iterator
+===================
+
+Definitions
+-----------
+
+ * **buffer head**: Shattered remnants of the old buffer cache.
+
+ * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
+
+ * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
+ Processes hold this in shared mode to read file state and contents.
+ Some filesystems may allow shared mode for writes.
+ Processes often hold this in exclusive mode to change file state and
+ contents.
+
+ * ``invalidate_lock``: The pagecache ``struct address_space``
+ rwsemaphore that protects against folio insertion and removal for
+ filesystems that support punching out folios below EOF.
+ Processes wishing to insert folios must hold this lock in shared
+ mode to prevent removal, though concurrent insertion is allowed.
+ Processes wishing to remove folios must hold this lock in exclusive
+ mode to prevent insertions.
+ Concurrent removals are not allowed.
+
+ * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
+ device pre-shutdown hook from returning before other threads have
+ released resources.
+
+ * **filesystem mapping lock**: This synchronization primitive is
+ internal to the filesystem and must protect the file mapping data
+ from updates while a mapping is being sampled.
+ The filesystem author must determine how this coordination should
+ happen; it does not need to be an actual lock.
+
+ * **iomap internal operation lock**: This is a general term for
+ synchronization primitives that iomap functions take while holding a
+ mapping.
+ A specific example would be taking the folio lock while reading or
+ writing the pagecache.
+
+ * **pure overwrite**: A write operation that does not require any
+ metadata or zeroing operations to perform during either submission
+ or completion.
+ This implies that the filesystem must have already allocated space
+ on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
+ constraints on IO alignment or size.
+ The only constraints on I/O alignment are device level (minimum I/O
+ size and alignment, typically sector size).
+
+``struct iomap``
+----------------
+
+The filesystem communicates to the iomap iterator the mapping of
+byte ranges of a file to byte ranges of a storage device with the
+structure below:
+
+.. code-block:: c
+
+ struct iomap {
+ u64 addr;
+ loff_t offset;
+ u64 length;
+ u16 type;
+ u16 flags;
+ struct block_device *bdev;
+ struct dax_device *dax_dev;
+ void *inline_data;
+ void *private;
+ const struct iomap_folio_ops *folio_ops;
+ u64 validity_cookie;
+ };
+
+The fields are as follows:
+
+ * ``offset`` and ``length`` describe the range of file offsets, in
+ bytes, covered by this mapping.
+ These fields must always be set by the filesystem.
+
+ * ``type`` describes the type of the space mapping:
+
+ * **IOMAP_HOLE**: No storage has been allocated.
+ This type must never be returned in response to an ``IOMAP_WRITE``
+ operation because writes must allocate and map space, and return
+ the mapping.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+ iomap does not support writing (whether via pagecache or direct
+ I/O) to a hole.
+
+ * **IOMAP_DELALLOC**: A promise to allocate space at a later time
+ ("delayed allocation").
+ If the filesystem returns IOMAP_F_NEW here and the write fails, the
+ ``->iomap_end`` function must delete the reservation.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * **IOMAP_MAPPED**: The file range maps to specific space on the
+ storage device.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+
+ * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
+ storage device, but the space has not yet been initialized.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+ Reads from this type of mapping will return zeroes to the caller.
+ For a write or writeback operation, the ioend should update the
+ mapping to MAPPED.
+ Refer to the sections about ioends for more details.
+
+ * **IOMAP_INLINE**: The file range maps to the memory buffer
+ specified by ``inline_data``.
+ For write operation, the ``->iomap_end`` function presumably
+ handles persisting the data.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * ``flags`` describe the status of the space mapping.
+ These flags should be set by the filesystem in ``->iomap_begin``:
+
+ * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
+ Areas that will not be written to must be zeroed.
+ If a write fails and the mapping is a space reservation, the
+ reservation must be deleted.
+
+ * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
+ to access any data written.
+ fdatasync is required to commit these changes to persistent
+ storage.
+ This needs to take into account metadata changes that *may* be made
+ at I/O completion, such as file size updates from direct I/O.
+
+ * **IOMAP_F_SHARED**: The space under the mapping is shared.
+ Copy on write is necessary to avoid corrupting other file data.
+
+ * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
+ heads for pagecache operations.
+ Do not add more uses of this.
+
+ * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
+ coalesced into this single mapping.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
+ regular file data.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be
+ merged with any other I/O or completion. Filesystems must use this when
+ submitting I/O to devices that cannot handle I/O crossing certain LBAs
+ (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all
+ other functions ignore it.
+
+ * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use.
+
+ * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target
+ block assigned to it yet and the file system will do that in the bio
+ submission handler, splitting the I/O as needed.
+
+ * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the
+ ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to
+ inform iomap that the write I/O operation requires torn-write protection
+ based on HW-offload mechanism. They must also ensure that mapping updates
+ upon the completion of the I/O must be performed in a single metadata
+ update.
+
+ These flags can be set by iomap itself during file operations.
+ The filesystem should supply an ``->iomap_end`` function if it needs
+ to observe these flags:
+
+ * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
+ using this mapping.
+
+ * **IOMAP_F_STALE**: The mapping was found to be stale.
+ iomap will call ``->iomap_end`` on this mapping and then
+ ``->iomap_begin`` to obtain a new mapping.
+
+ Currently, these flags are only set by pagecache operations.
+
+ * ``addr`` describes the device address, in bytes.
+
+ * ``bdev`` describes the block device for this mapping.
+ This only needs to be set for mapped or unwritten operations.
+
+ * ``dax_dev`` describes the DAX device for this mapping.
+ This only needs to be set for mapped or unwritten operations, and
+ only for a fsdax operation.
+
+ * ``inline_data`` points to a memory buffer for I/O involving
+ ``IOMAP_INLINE`` mappings.
+ This value is ignored for all other mapping types.
+
+ * ``private`` is a pointer to `filesystem-private information
+ <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
+ This value will be passed unchanged to ``->iomap_end``.
+
+ * ``folio_ops`` will be covered in the section on pagecache operations.
+
+ * ``validity_cookie`` is a magic freshness value set by the filesystem
+ that should be used to detect stale mappings.
+ For pagecache operations this is critical for correct operation
+ because page faults can occur, which implies that filesystem locks
+ should not be held between ``->iomap_begin`` and ``->iomap_end``.
+ Filesystems with completely static mappings need not set this value.
+ Only pagecache operations revalidate mappings; see the section about
+ ``iomap_valid`` for details.
+
+``struct iomap_ops``
+--------------------
+
+Every iomap function requires the filesystem to pass an operations
+structure to obtain a mapping and (optionally) to release the mapping:
+
+.. code-block:: c
+
+ struct iomap_ops {
+ int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
+
+ int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap);
+ };
+
+``->iomap_begin``
+~~~~~~~~~~~~~~~~~
+
+iomap operations call ``->iomap_begin`` to obtain one file mapping for
+the range of bytes specified by ``pos`` and ``length`` for the file
+``inode``.
+This mapping should be returned through the ``iomap`` pointer.
+The mapping must cover at least the first byte of the supplied file
+range, but it does not need to cover the entire requested range.
+
+Each iomap operation describes the requested operation through the
+``flags`` argument.
+The exact value of ``flags`` will be documented in the
+operation-specific sections below.
+These flags can, at least in principle, apply generally to iomap
+operations:
+
+ * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
+ block storage.
+
+ * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
+ memory-like storage.
+
+ * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
+ effort attempt to avoid any operation that would result in blocking
+ the submitting task.
+ This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
+ intended for asynchronous applications to keep doing other work
+ instead of waiting for the specific unavailable filesystem resource
+ to become available.
+ Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
+ trylock algorithms.
+ They need to be able to satisfy the entire I/O request range with a
+ single iomap mapping.
+ They need to avoid reading or writing metadata synchronously.
+ They need to avoid blocking memory allocations.
+ They need to avoid waiting on transaction reservations to allow
+ modifications to take place.
+ They probably should not be allocating new space.
+ And so on.
+ If there is any doubt in the filesystem developer's mind as to
+ whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
+ then they should return ``-EAGAIN`` as early as possible rather than
+ start the operation and force the submitting task to block.
+ ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
+ ``RWF_NOWAIT``.
+
+ * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a
+ buffered file I/O and would like the kernel to drop the pagecache
+ after the I/O completes, if it isn't already being used by another
+ thread.
+
+If it is necessary to read existing file contents from a `different
+<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+device or address range on a device, the filesystem should return that
+information via ``srcmap``.
+Only pagecache and fsdax operations support reading from one mapping and
+writing to another.
+
+``->iomap_end``
+~~~~~~~~~~~~~~~
+
+After the operation completes, the ``->iomap_end`` function, if present,
+is called to signal that iomap is finished with a mapping.
+Typically, implementations will use this function to tear down any
+context that were set up in ``->iomap_begin``.
+For example, a write might wish to commit the reservations for the bytes
+that were operated upon and unreserve any space that was not operated
+upon.
+``written`` might be zero if no bytes were touched.
+``flags`` will contain the same value passed to ``->iomap_begin``.
+iomap ops for reads are not likely to need to supply this function.
+
+Both functions should return a negative errno code on error, or zero on
+success.
+
+Preparing for File Operations
+=============================
+
+iomap only handles mapping and I/O.
+Filesystems must still call out to the VFS to check input parameters
+and file state before initiating an I/O operation.
+It does not handle obtaining filesystem freeze protection, updating of
+timestamps, stripping privileges, or access control.
+
+Locking Hierarchy
+=================
+
+iomap requires that filesystems supply their own locking model.
+There are three categories of synchronization primitives, as far as
+iomap is concerned:
+
+ * The **upper** level primitive is provided by the filesystem to
+ coordinate access to different iomap operations.
+ The exact primitive is specific to the filesystem and operation,
+ but is often a VFS inode, pagecache invalidation, or folio lock.
+ For example, a filesystem might take ``i_rwsem`` before calling
+ ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
+ these two file operations from clobbering each other.
+ Pagecache writeback may lock a folio to prevent other threads from
+ accessing the folio until writeback is underway.
+
+ * The **lower** level primitive is taken by the filesystem in the
+ ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
+ access to the file space mapping information.
+ The fields of the iomap object should be filled out while holding
+ this primitive.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring the lower level synchronization primitive.
+ For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
+ while sampling mappings.
+ Filesystems with immutable mapping information may not require
+ synchronization here.
+
+ * The **operation** primitive is taken by an iomap operation to
+ coordinate access to its own internal data structures.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring this primitive.
+ The lower level primitive is not held while acquiring this
+ primitive.
+ For example, pagecache write operations will obtain a file mapping,
+ then grab and lock a folio to copy new contents.
+ It may also lock an internal folio state object to update metadata.
+
+The exact locking requirements are specific to the filesystem; for
+certain operations, some of these locks can be elided.
+All further mentions of locking are *recommendations*, not mandates.
+Each filesystem author must figure out the locking for themself.
+
+Bugs and Limitations
+====================
+
+ * No support for fscrypt.
+ * No support for compression.
+ * No support for fsverity yet.
+ * Strong assumptions that IO should work the way it does on XFS.
+ * Does iomap *actually* work for non-regular file data?
+
+Patches welcome!
diff --git a/Documentation/filesystems/iomap/index.rst b/Documentation/filesystems/iomap/index.rst
new file mode 100644
index 000000000000..3c6a52440250
--- /dev/null
+++ b/Documentation/filesystems/iomap/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+VFS iomap Documentation
+=======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ design
+ operations
+ porting
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
new file mode 100644
index 000000000000..3b628e370d88
--- /dev/null
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -0,0 +1,744 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_operations:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=========================
+Supported File Operations
+=========================
+
+.. contents:: Table of Contents
+ :local:
+
+Below are a discussion of the high level file operations that iomap
+implements.
+
+Buffered I/O
+============
+
+Buffered I/O is the default file I/O path in Linux.
+File contents are cached in memory ("pagecache") to satisfy reads and
+writes.
+Dirty cache will be written back to disk at some point that can be
+forced via ``fsync`` and variants.
+
+iomap implements nearly all the folio and pagecache management that
+filesystems have to implement themselves under the legacy I/O model.
+This means that the filesystem need not know the details of allocating,
+mapping, managing uptodate and dirty state, or writeback of pagecache
+folios.
+Under the legacy I/O model, this was managed very inefficiently with
+linked lists of buffer heads instead of the per-folio bitmaps that iomap
+uses.
+Unless the filesystem explicitly opts in to buffer heads, they will not
+be used, which makes buffered I/O much more efficient, and the pagecache
+maintainer much happier.
+
+``struct address_space_operations``
+-----------------------------------
+
+The following iomap functions can be referenced directly from the
+address space operations structure:
+
+ * ``iomap_dirty_folio``
+ * ``iomap_release_folio``
+ * ``iomap_invalidate_folio``
+ * ``iomap_is_partially_uptodate``
+
+The following address space operations can be wrapped easily:
+
+ * ``read_folio``
+ * ``readahead``
+ * ``writepages``
+ * ``bmap``
+ * ``swap_activate``
+
+``struct iomap_folio_ops``
+--------------------------
+
+The ``->iomap_begin`` function for pagecache operations may set the
+``struct iomap::folio_ops`` field to an ops structure to override
+default behaviors of iomap:
+
+.. code-block:: c
+
+ struct iomap_folio_ops {
+ struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
+ unsigned len);
+ void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
+ struct folio *folio);
+ bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ };
+
+iomap calls these functions:
+
+ - ``get_folio``: Called to allocate and return an active reference to
+ a locked folio prior to starting a write.
+ If this function is not provided, iomap will call
+ ``iomap_get_folio``.
+ This could be used to `set up per-folio filesystem state
+ <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
+ for a write.
+
+ - ``put_folio``: Called to unlock and put a folio after a pagecache
+ operation completes.
+ If this function is not provided, iomap will ``folio_unlock`` and
+ ``folio_put`` on its own.
+ This could be used to `commit per-folio filesystem state
+ <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
+ that was set up by ``->get_folio``.
+
+ - ``iomap_valid``: The filesystem may not hold locks between
+ ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
+ can take folio locks, fault on userspace pages, initiate writeback
+ for memory reclamation, or engage in other time-consuming actions.
+ If a file's space mapping data are mutable, it is possible that the
+ mapping for a particular pagecache folio can `change in the time it
+ takes
+ <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
+ to allocate, install, and lock that folio.
+
+ For the pagecache, races can happen if writeback doesn't take
+ ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
+ Races can also happen if the filesystem allows concurrent writes.
+ For such files, the mapping *must* be revalidated after the folio
+ lock has been taken so that iomap can manage the folio correctly.
+
+ fsdax does not need this revalidation because there's no writeback
+ and no support for unwritten extents.
+
+ Filesystems subject to this kind of race must provide a
+ ``->iomap_valid`` function to decide if the mapping is still valid.
+ If the mapping is not valid, the mapping will be sampled again.
+
+ To support making the validity decision, the filesystem's
+ ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
+ at the same time that it populates the other iomap fields.
+ A simple validation cookie implementation is a sequence counter.
+ If the filesystem bumps the sequence counter every time it modifies
+ the inode's extent map, it can be placed in the ``struct
+ iomap::validity_cookie`` during ``->iomap_begin``.
+ If the value in the cookie is found to be different to the value
+ the filesystem holds when the mapping is passed back to
+ ``->iomap_valid``, then the iomap should considered stale and the
+ validation failed.
+
+These ``struct kiocb`` flags are significant for buffered I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
+
+Internal per-Folio State
+------------------------
+
+If the fsblock size matches the size of a pagecache folio, it is assumed
+that all disk I/O operations will operate on the entire folio.
+The uptodate (memory contents are at least as new as what's on disk) and
+dirty (memory contents are newer than what's on disk) status of the
+folio are all that's needed for this case.
+
+If the fsblock size is less than the size of a pagecache folio, iomap
+tracks the per-fsblock uptodate and dirty state itself.
+This enables iomap to handle both "bs < ps" `filesystems
+<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
+and large folios in the pagecache.
+
+iomap internally tracks two state bits per fsblock:
+
+ * ``uptodate``: iomap will try to keep folios fully up to date.
+ If there are read(ahead) errors, those fsblocks will not be marked
+ uptodate.
+ The folio itself will be marked uptodate when all fsblocks within the
+ folio are uptodate.
+
+ * ``dirty``: iomap will set the per-block dirty state when programs
+ write to the file.
+ The folio itself will be marked dirty when any fsblock within the
+ folio is dirty.
+
+iomap also tracks the amount of read and write disk IOs that are in
+flight.
+This structure is much lighter weight than ``struct buffer_head``
+because there is only one per folio, and the per-fsblock overhead is two
+bits vs. 104 bytes.
+
+Filesystems wishing to turn on large folios in the pagecache should call
+``mapping_set_large_folios`` when initializing the incore inode.
+
+Buffered Readahead and Reads
+----------------------------
+
+The ``iomap_readahead`` function initiates readahead to the pagecache.
+The ``iomap_read_folio`` function reads one folio's worth of data into
+the pagecache.
+The ``flags`` argument to ``->iomap_begin`` will be set to zero.
+The pagecache takes whatever locks it needs before calling the
+filesystem.
+
+Buffered Writes
+---------------
+
+The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
+pagecache.
+``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
+the ``flags`` argument to ``->iomap_begin``.
+Callers commonly take ``i_rwsem`` in either shared or exclusive mode
+before calling this function.
+
+mmap Write Faults
+~~~~~~~~~~~~~~~~~
+
+The ``iomap_page_mkwrite`` function handles a write fault to a folio in
+the pagecache.
+``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers commonly take the mmap ``invalidate_lock`` in shared or
+exclusive mode before calling this function.
+
+Buffered Write Failures
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a short write to the pagecache, the areas not written will not
+become marked dirty.
+The filesystem must arrange to `cancel
+<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
+such `reservations
+<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
+because writeback will not consume the reservation.
+The ``iomap_write_delalloc_release`` can be called from a
+``->iomap_end`` function to find all the clean areas of the folios
+caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
+It takes the ``invalidate_lock``.
+
+The filesystem must supply a function ``punch`` to be called for
+each file range in this state.
+This function must *only* remove delayed allocation reservations, in
+case another thread racing with the current thread writes successfully
+to the same region and triggers writeback to flush the dirty data out to
+disk.
+
+Zeroing for File Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_zero_range`` to perform zeroing of the
+pagecache for non-truncation file operations that are not aligned to
+the fsblock size.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Unsharing Reflinked File Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_file_unshare`` to force a file sharing
+storage with another file to preemptively copy the shared data to newly
+allocate storage.
+``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Truncation
+----------
+
+Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
+pagecache from EOF to the end of the fsblock during a file truncation
+operation.
+``truncate_setsize`` or ``truncate_pagecache`` will take care of
+everything after the EOF block.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Pagecache Writeback
+-------------------
+
+Filesystems can call ``iomap_writepages`` to respond to a request to
+write dirty pagecache folios to disk.
+The ``mapping`` and ``wbc`` parameters should be passed unchanged.
+The ``wpc`` pointer should be allocated by the filesystem and must
+be initialized to zero.
+
+The pagecache will lock each folio before trying to schedule it for
+writeback.
+It does not lock ``i_rwsem`` or ``invalidate_lock``.
+
+The dirty bit will be cleared for all folios run through the
+``->map_blocks`` machinery described below even if the writeback fails.
+This is to prevent dirty folio clots when storage devices fail; an
+``-EIO`` is recorded for userspace to collect via ``fsync``.
+
+The ``ops`` structure must be specified and is as follows:
+
+``struct iomap_writeback_ops``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ struct iomap_writeback_ops {
+ int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
+ loff_t offset, unsigned len);
+ int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
+ void (*discard_folio)(struct folio *folio, loff_t pos);
+ };
+
+The fields are as follows:
+
+ - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ range (in bytes) given by ``offset`` and ``len``.
+ iomap calls this function for each dirty fs block in each dirty folio,
+ though it will `reuse mappings
+ <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
+ for runs of contiguous dirty fsblocks within a folio.
+ Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
+ function must deal with persisting written data.
+ Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
+ requires mapping to allocated space.
+ Filesystems can skip a potentially expensive mapping lookup if the
+ mappings have not changed.
+ This revalidation must be open-coded by the filesystem; it is
+ unclear if ``iomap::validity_cookie`` can be reused for this
+ purpose.
+ This function must be supplied by the filesystem.
+
+ - ``submit_ioend``: Allows the file systems to hook into writeback bio
+ submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function is optional.
+
+ - ``discard_folio``: iomap calls this function after ``->map_blocks``
+ fails to schedule I/O for any part of a dirty folio.
+ The function should throw away any reservations that may have been
+ made for the write.
+ The folio will be marked clean and an ``-EIO`` recorded in the
+ pagecache.
+ Filesystems can use this callback to `remove
+ <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
+ delalloc reservations to avoid having delalloc reservations for
+ clean pagecache.
+ This function is optional.
+
+Pagecache Writeback Completion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To handle the bookkeeping that must happen after disk I/O for writeback
+completes, iomap creates chains of ``struct iomap_ioend`` objects that
+wrap the ``bio`` that is used to write pagecache data to disk.
+By default, iomap finishes writeback ioends by clearing the writeback
+bit on the folios attached to the ``ioend``.
+If the write failed, it will also set the error bits on the folios and
+the address space.
+This can happen in interrupt or process context, depending on the
+storage device.
+
+Filesystems that need to update internal bookkeeping (e.g. unwritten
+extent conversions) should provide a ``->submit_ioend`` function to
+set ``struct iomap_end::bio::bi_end_io`` to its own function.
+This function should call ``iomap_finish_ioends`` after finishing its
+own work (e.g. unwritten extent conversion).
+
+Some filesystems may wish to `amortize the cost of running metadata
+transactions
+<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
+for post-writeback updates by batching them.
+They may also require transactions to run from process context, which
+implies punting batches to a workqueue.
+iomap ioends contain a ``list_head`` to enable batching.
+
+Given a batch of ioends, iomap has a few helpers to assist with
+amortization:
+
+ * ``iomap_sort_ioends``: Sort all the ioends in the list by file
+ offset.
+
+ * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
+ a separate list of sorted ioends, merge as many of the ioends from
+ the head of the list into the given ioend.
+ ioends can only be merged if the file range and storage addresses are
+ contiguous; the unwritten and shared status are the same; and the
+ write I/O outcome is the same.
+ The merged ioends become their own list.
+
+ * ``iomap_finish_ioends``: Finish an ioend that possibly has other
+ ioends linked to it.
+
+Direct I/O
+==========
+
+In Linux, direct I/O is defined as file I/O that is issued directly to
+storage, bypassing the pagecache.
+The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
+writes for files.
+
+.. code-block:: c
+
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ const struct iomap_ops *ops,
+ const struct iomap_dio_ops *dops,
+ unsigned int dio_flags, void *private,
+ size_t done_before);
+
+The filesystem can provide the ``dops`` parameter if it needs to perform
+extra work before or after the I/O is issued to storage.
+The ``done_before`` parameter tells the how much of the request has
+already been transferred.
+It is used to continue a request asynchronously when `part of the
+request
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
+has already been completed synchronously.
+
+The ``done_before`` parameter should be set if writes for the ``iocb``
+have been initiated prior to the call.
+The direction of the I/O is determined from the ``iocb`` passed in.
+
+The ``dio_flags`` argument can be set to any combination of the
+following values:
+
+ * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
+ kiocb is not synchronous.
+
+ * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
+ or fail with ``-EAGAIN``.
+ This can be used by filesystems with complex unaligned I/O
+ write paths to provide an optimised fast path for unaligned writes.
+ If a pure overwrite can be performed, then serialisation against
+ other I/Os to the same filesystem block(s) is unnecessary as there is
+ no risk of stale data exposure or data loss.
+ If a pure overwrite cannot be performed, then the filesystem can
+ perform the serialisation steps needed to provide exclusive access
+ to the unaligned I/O range so that it can perform allocation and
+ sub-block zeroing safely.
+ Filesystems can use this flag to try to reduce locking contention,
+ but a lot of `detailed checking
+ <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
+ is required to do it `correctly
+ <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
+
+ * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
+ progress has already been made.
+ The caller may deal with the page fault and retry the operation.
+ If the caller decides to retry the operation, it should pass the
+ accumulated return values of all previous calls as the
+ ``done_before`` parameter to the next call.
+
+These ``struct kiocb`` flags are significant for direct I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
+ before completing the call.
+ In the case of pure overwrites, the I/O may be issued with FUA
+ enabled.
+
+ * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
+ interrupt.
+ Only meaningful for asynchronous I/O, and only if the entire I/O can
+ be issued as a single ``struct bio``.
+
+ * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
+ process context.
+ See ``linux/fs.h`` for more details.
+
+Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
+``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
+function for the file.
+They should not set ``->direct_IO``, which is deprecated.
+
+If a filesystem wishes to perform its own work before direct I/O
+completion, it should call ``__iomap_dio_rw``.
+If its return value is not an error pointer or a NULL pointer, the
+filesystem should pass the return value to ``iomap_dio_complete`` after
+finishing its internal work.
+
+Return Values
+-------------
+
+``iomap_dio_rw`` can return one of the following:
+
+ * A non-negative number of bytes transferred.
+
+ * ``-ENOTBLK``: Fall back to buffered I/O.
+ iomap itself will return this value if it cannot invalidate the page
+ cache before issuing the I/O to storage.
+ The ``->iomap_begin`` or ``->iomap_end`` functions may also return
+ this value.
+
+ * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
+ queued and will be completed separately.
+
+ * Any of the other negative error codes.
+
+Direct Reads
+------------
+
+A direct I/O read initiates a read I/O from the storage device to the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the read io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
+any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Direct Writes
+-------------
+
+A direct I/O write initiates a write I/O to the storage device from the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the write io.
+The pagecache is invalidated both before and after the write io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
+ blocks is not allowed.
+ The entire file range must map to a single written or unwritten
+ extent.
+ The file I/O range must be aligned to the filesystem block size
+ if the mapping is unwritten and the filesystem cannot handle zeroing
+ the unaligned regions without exposing stale contents.
+
+ * ``IOMAP_ATOMIC``: This write is being issued with torn-write
+ protection.
+ Torn-write protection may be provided based on HW-offload or by a
+ software mechanism provided by the filesystem.
+
+ For HW-offload based support, only a single bio can be created for the
+ write, and the write must not be split into multiple I/O requests, i.e.
+ flag REQ_ATOMIC must be set.
+ The file range to write must be aligned to satisfy the requirements
+ of both the filesystem and the underlying block device's atomic
+ commit capabilities.
+ If filesystem metadata updates are required (e.g. unwritten extent
+ conversion or copy-on-write), all updates for the entire file range
+ must be committed atomically as well.
+ Untorn-writes may be longer than a single file block. In all cases,
+ the mapping start disk block must have at least the same alignment as
+ the write offset.
+ The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
+ untorn-write based on HW-offload.
+
+ For untorn-writes based on a software mechanism provided by the
+ filesystem, all the disk block alignment and single bio restrictions
+ which apply for HW-offload based untorn-writes do not apply.
+ The mechanism would typically be used as a fallback for when
+ HW-offload based untorn-writes may not be issued, e.g. the range of the
+ write covers multiple extents, meaning that it is not possible to issue
+ a single bio.
+ All filesystem metadata updates for the entire file range must be
+ committed atomically as well.
+
+Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
+calling this function.
+
+``struct iomap_dio_ops:``
+-------------------------
+.. code-block:: c
+
+ struct iomap_dio_ops {
+ void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
+ loff_t file_offset);
+ int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
+ unsigned flags);
+ struct bio_set *bio_set;
+ };
+
+The fields of this structure are as follows:
+
+ - ``submit_io``: iomap calls this function when it has constructed a
+ ``struct bio`` object for the I/O requested, and wishes to submit it
+ to the block device.
+ If no function is provided, ``submit_bio`` will be called directly.
+ Filesystems that would like to perform additional work before (e.g.
+ data replication for btrfs) should implement this function.
+
+ - ``end_io``: This is called after the ``struct bio`` completes.
+ This function should perform post-write conversions of unwritten
+ extent mappings, handle write failures, etc.
+ The ``flags`` argument may be set to a combination of the following:
+
+ * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
+ should mark the extent as written.
+
+ * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
+ copy on write operation, so the ioend should switch mappings.
+
+ - ``bio_set``: This allows the filesystem to provide a custom bio_set
+ for allocating direct I/O bios.
+ This enables filesystems to `stash additional per-bio information
+ <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
+ for private use.
+ If this field is NULL, generic ``struct bio`` objects will be used.
+
+Filesystems that want to perform extra work after an I/O completion
+should set a custom ``->bi_end_io`` function via ``->submit_io``.
+Afterwards, the custom endio function must call
+``iomap_dio_bio_end_io`` to finish the direct I/O.
+
+DAX I/O
+=======
+
+Some storage devices can be directly mapped as memory.
+These devices support a new access mode known as "fsdax" that allows
+loads and stores through the CPU and memory controller.
+
+fsdax Reads
+-----------
+
+A fsdax read performs a memcpy from storage device to the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
+combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+fsdax Writes
+------------
+
+A fsdax write initiates a memcpy to the storage device from the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
+ performed from this mapping.
+ This requires the filesystem extent mapping to already exist as an
+ ``IOMAP_MAPPED`` type and span the entire range of the write I/O
+ request.
+ If the filesystem cannot map this request in a way that allows the
+ iomap infrastructure to perform a pure overwrite, it must fail the
+ mapping operation with ``-EAGAIN``.
+
+Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
+function.
+
+fsdax mmap Faults
+~~~~~~~~~~~~~~~~~
+
+The ``dax_iomap_fault`` function handles read and write faults to fsdax
+storage.
+For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
+``flags`` argument to ``->iomap_begin``.
+For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
+passed as the ``flags`` argument to ``->iomap_begin``.
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Truncation, fallocate, and Unsharing
+------------------------------------------
+
+For fsdax files, the following functions are provided to replace their
+iomap pagecache I/O counterparts.
+The ``flags`` argument to ``->iomap_begin`` are the same as the
+pagecache counterparts, with ``IOMAP_DAX`` added.
+
+ * ``dax_file_unshare``
+ * ``dax_zero_range``
+ * ``dax_truncate_page``
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Deduplication
+-------------------
+
+Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
+``dax_remap_file_range_prep`` function with their own iomap read ops.
+
+Seeking Files
+=============
+
+iomap implements the two iterating whence modes of the ``llseek`` system
+call.
+
+SEEK_DATA
+---------
+
+The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with a folio mapped and uptodate fsblocks
+within those folios will be reported as data areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+SEEK_HOLE
+---------
+
+The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with no folio mapped, or a !uptodate fsblock
+within a folio will be reported as sparse hole areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Swap File Activation
+====================
+
+The ``iomap_swapfile_activate`` function finds all the base-page aligned
+regions in a file and sets them up as swap space.
+The file will be ``fsync()``'d before activation.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+All mappings must be mapped or unwritten; cannot be dirty or shared, and
+cannot span multiple block devices.
+Callers must hold ``i_rwsem`` in exclusive mode; this is already
+provided by ``swapon``.
+
+File Space Mapping Reporting
+============================
+
+iomap implements two of the file space mapping system calls.
+
+FS_IOC_FIEMAP
+-------------
+
+The ``iomap_fiemap`` function exports file extent mappings to userspace
+in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+FIBMAP (deprecated)
+-------------------
+
+``iomap_bmap`` implements FIBMAP.
+The calling conventions are the same as for FIEMAP.
+This function is only provided to maintain compatibility for filesystems
+that implemented FIBMAP prior to conversion.
+This ioctl is deprecated; do **not** add a FIBMAP implementation to
+filesystems that do not have it.
+Callers should probably hold ``i_rwsem`` in shared mode before calling
+this function, but this is unclear.
diff --git a/Documentation/filesystems/iomap/porting.rst b/Documentation/filesystems/iomap/porting.rst
new file mode 100644
index 000000000000..3d49a32c0fff
--- /dev/null
+++ b/Documentation/filesystems/iomap/porting.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_porting:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=======================
+Porting Your Filesystem
+=======================
+
+.. contents:: Table of Contents
+ :local:
+
+Why Convert?
+============
+
+There are several reasons to convert a filesystem to iomap:
+
+ 1. The classic Linux I/O path is not terribly efficient.
+ Pagecache operations lock a single base page at a time and then call
+ into the filesystem to return a mapping for only that page.
+ Direct I/O operations build I/O requests a single file block at a
+ time.
+ This worked well enough for direct/indirect-mapped filesystems such
+ as ext2, but is very inefficient for extent-based filesystems such
+ as XFS.
+
+ 2. Large folios are only supported via iomap; there are no plans to
+ convert the old buffer_head path to use them.
+
+ 3. Direct access to storage on memory-like devices (fsdax) is only
+ supported via iomap.
+
+ 4. Lower maintenance overhead for individual filesystem maintainers.
+ iomap handles common pagecache related operations itself, such as
+ allocating, instantiating, locking, and unlocking of folios.
+ No ->write_begin(), ->write_end() or direct_IO
+ address_space_operations are required to be implemented by
+ filesystem using iomap.
+
+How Do I Convert a Filesystem?
+==============================
+
+First, add ``#include <linux/iomap.h>`` from your source code and add
+``select FS_IOMAP`` to your filesystem's Kconfig option.
+Build the kernel, run fstests with the ``-g all`` option across a wide
+variety of your filesystem's supported configurations to build a
+baseline of which tests pass and which ones fail.
+
+The recommended approach is first to implement ``->iomap_begin`` (and
+``->iomap_end`` if necessary) to allow iomap to obtain a read-only
+mapping of a file range.
+In most cases, this is a relatively trivial conversion of the existing
+``get_block()`` function for read-only mappings.
+``FS_IOC_FIEMAP`` is a good first target because it is trivial to
+implement support for it and then to determine that the extent map
+iteration is correct from userspace.
+If FIEMAP is returning the correct information, it's a good sign that
+other read-only mapping operations will do the right thing.
+
+Next, modify the filesystem's ``get_block(create = false)``
+implementation to use the new ``->iomap_begin`` implementation to map
+file space for selected read operations.
+Hide behind a debugging knob the ability to switch on the iomap mapping
+functions for selected call paths.
+It is necessary to write some code to fill out the bufferhead-based
+mapping information from the ``iomap`` structure, but the new functions
+can be tested without needing to implement any iomap APIs.
+
+Once the read-only functions are working like this, convert each high
+level file operation one by one to use iomap native APIs instead of
+going through ``get_block()``.
+Done one at a time, regressions should be self evident.
+You *do* have a regression test baseline for fstests, right?
+It is suggested to convert swap file activation, ``SEEK_DATA``, and
+``SEEK_HOLE`` before tackling the I/O paths.
+A likely complexity at this point will be converting the buffered read
+I/O path because of bufferheads.
+The buffered read I/O paths doesn't need to be converted yet, though the
+direct I/O read path should be converted in this phase.
+
+At this point, you should look over your ``->iomap_begin`` function.
+If it switches between large blocks of code based on dispatching of the
+``flags`` argument, you should consider breaking it up into
+per-operation iomap ops with smaller, more cohesive functions.
+XFS is a good example of this.
+
+The next thing to do is implement ``get_blocks(create == true)``
+functionality in the ``->iomap_begin``/``->iomap_end`` methods.
+It is strongly recommended to create separate mapping functions and
+iomap ops for write operations.
+Then convert the direct I/O write path to iomap, and start running fsx
+w/ DIO enabled in earnest on filesystem.
+This will flush out lots of data integrity corner case bugs that the new
+write mapping implementation introduces.
+
+Now, convert any remaining file operations to call the iomap functions.
+This will get the entire filesystem using the new mapping functions, and
+they should largely be debugged and working correctly after this step.
+
+Most likely at this point, the buffered read and write paths will still
+need to be converted.
+The mapping functions should all work correctly, so all that needs to be
+done is rewriting all the code that interfaces with bufferheads to
+interface with iomap and folios.
+It is much easier first to get regular file I/O (without any fancy
+features like fscrypt, fsverity, compression, or data=journaling)
+converted to use iomap.
+Some of those fancy features (fscrypt and compression) aren't
+implemented yet in iomap.
+For unjournalled filesystems that use the pagecache for symbolic links
+and directories, you might also try converting their handling to iomap.
+
+The rest is left as an exercise for the reader, as it will be different
+for every filesystem.
+If you encounter problems, email the people and lists in
+``get_maintainers.pl`` for help.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index e18f90ffc6fd..863e93e623f7 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -111,9 +111,7 @@ a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
+called after each transaction commit.
JBD2 also provides a way to block all transaction updates via
jbd2_journal_lock_updates() /
@@ -137,7 +135,7 @@ Fast commits
JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you will need to set following
-callbacks that perform correspodning work:
+callbacks that perform corresponding work:
`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.
@@ -149,7 +147,7 @@ File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
file system should tell JBD2 about it by calling
-:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full
commit immediately after stopping the fast commit it can do so by calling
:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
fails for some reason and the only way to guarantee consistency is for JBD2 to
@@ -199,7 +197,7 @@ Journal Level
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
-Transasction Level
+Transaction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index e664061ed55d..2e567e341c3b 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -17,7 +17,8 @@ dentry_operations
prototypes::
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -30,6 +31,8 @@ prototypes::
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
locking rules:
@@ -49,6 +52,8 @@ d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
+d_unalias_trylock yes no no no
+d_unalias_unlock yes no no no
================== =========== ======== ============== ========
inode_operations
@@ -61,7 +66,7 @@ prototypes::
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
@@ -244,17 +249,16 @@ address_space_operations
========================
prototypes::
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
+ struct folio **foliop, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -275,12 +279,11 @@ locking rules:
====================== ======================== ========= ===============
ops folio locked i_rwsem invalidate_lock
====================== ======================== ========= ===============
-writepage: yes, unlocks (see below)
read_folio: yes, unlocks shared
writepages:
dirty_folio: maybe
readahead: yes, unlocks shared
-write_begin: locks the page exclusive
+write_begin: locks the folio exclusive
write_end: yes, unlocks exclusive
bmap:
invalidate_folio: yes exclusive
@@ -304,54 +307,6 @@ completion.
->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
-->writepage() is used for two purposes: for "memory cleansing" and for
-"sync". These are quite different operations and the behaviour may differ
-depending upon the mode.
-
-If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
-it *must* start I/O against the page, even if that would involve
-blocking on in-progress I/O.
-
-If writepage is called for memory cleansing (sync_mode ==
-WBC_SYNC_NONE) then its role is to get as much writeout underway as
-possible. So writepage should try to avoid blocking against
-currently-in-progress I/O.
-
-If the filesystem is not called for "sync" and it determines that it
-would need to block against in-progress I/O to be able to start new I/O
-against the page the filesystem should redirty the page with
-redirty_page_for_writepage(), then unlock the page and return zero.
-This may also be done to avoid internal deadlocks, but rarely.
-
-If the filesystem is called for sync then it must wait on any
-in-progress I/O and then start new I/O.
-
-The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
-
-Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
-and return zero, writepage *must* run set_page_writeback() against the page,
-followed by unlocking it. Once set_page_writeback() has been run against the
-page, write I/O can be submitted and the write I/O completion handler must run
-end_page_writeback() once the I/O is complete. If no I/O is submitted, the
-filesystem must run end_page_writeback() against the page before returning from
-writepage.
-
-That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
-if the filesystem needs the page to be locked during writeout, that is ok, too,
-the page is allowed to be unlocked at any point in time between the calls to
-set_page_writeback() and end_page_writeback().
-
-Note, failure to run either redirty_page_for_writepage() or the combination of
-set_page_writeback()/end_page_writeback() on a page submitted to writepage
-will leave the page itself marked clean but it will be tagged as dirty in the
-radix tree. This incoherency can lead to all sorts of hard-to-debug problems
-in the filesystem like having dirty inodes at umount and losing written data.
-
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page
@@ -359,8 +314,8 @@ which is written. The address_space implementation may write more (or less)
pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
If nr_to_write is NULL, all dirty pages must be written.
-writepages should _only_ write pages which are present on
-mapping->io_pages.
+writepages should _only_ write pages which are present in
+mapping->i_pages.
->dirty_folio() is called from various places in the kernel when
the target folio is marked as needing writeback. The folio cannot be
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index 9aaf6ef75eb5..e149b89118c8 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -645,6 +645,8 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ fs_param_is_uid User ID (u32) result->uid
+ fs_param_is_gid Group ID (u32) result->gid
======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
@@ -669,7 +671,6 @@ The members are as follows:
fsparam_bool() fs_param_is_bool
fsparam_u32() fs_param_is_u32
fsparam_u32oct() fs_param_is_u32_octal
- fsparam_u32hex() fs_param_is_u32_hex
fsparam_s32() fs_param_is_s32
fsparam_u64() fs_param_is_u64
fsparam_enum() fs_param_is_enum
@@ -678,6 +679,8 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ fsparam_uid() fs_param_is_uid
+ fsparam_gid() fs_param_is_gid
======================= ===============================================
all of which take two arguments, name string and option number - for
@@ -751,22 +754,8 @@ process the parameters it is given.
* ::
- bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
-
- Validate a constant table. Checks that all the elements are appropriately
- ordered, that there are no duplicates and that the values are between low
- and high inclusive, though provision is made for one allowable special
- value outside of that range. If no special value is required, special
- should just be set to lie inside the low-to-high range.
-
- If all is good, true is returned. If the table is invalid, errors are
- logged to the kernel log buffer and false is returned.
-
- * ::
-
- bool fs_validate_description(const struct fs_parameter_description *desc);
+ bool fs_validate_description(const char *name,
+ const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
@@ -784,8 +773,9 @@ process the parameters it is given.
option number (which it returns).
If successful, and if the parameter type indicates the result is a
- boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32,uint_64}.
+ boolean, integer, enum, uid, or gid type, the value is converted by this
+ function and the result stored in
+ result->{boolean,int_32,uint_32,uint_64,uid,gid}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index 000000000000..c779e47284e8
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigrain Timestamps
+=====================
+
+Introduction
+============
+Historically, the kernel has always used coarse time values to stamp inodes.
+This value is updated every jiffy, so any change that happens within that jiffy
+will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+================
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+ The inode change time. This is stamped with the current time whenever
+ the inode's metadata is changed. Note that this value is not settable
+ from userland.
+
+mtime:
+ The inode modification time. This is stamped with the current time
+ any time a file's contents change.
+
+atime:
+ The inode access time. This is stamped whenever an inode's contents are
+ read. Widely considered to be a terrible mistake. Usually avoided with
+ options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+========================
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Note that the above assumes that the system doesn't experience a backward jump
+of the realtime clock. If that occurs at an inopportune time, then timestamps
+can appear to go backward, even on a properly functioning system.
+
+Multigrain Timestamp Implementation
+===================================
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=====================
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and then a second file modified later could get a coarse-grained one
+that appears earlier than the first, which would break the kernel's timestamp
+ordering guarantees.
+
+To mitigate this problem, maintain a global floor value that ensures that
+this can't happen. The two files in the above example may appear to have been
+modified at the same time in such a case, but they will never show the reverse
+order. To avoid problems with realtime clock jumps, the floor is managed as a
+monotonic ktime_t, and the values are converted to realtime clock values as
+needed.
+
+Implementation Notes
+====================
+Multigrain timestamps are intended for use by local filesystems that get
+ctime values from the local clock. This is in contrast to network filesystems
+and the like that just mirror timestamp values from a server.
+
+For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
+fstype->fs_flags in order to opt-in, providing the ctime is only ever set via
+inode_set_ctime_current(). If the filesystem has a ->getattr routine that
+doesn't call generic_fillattr, then it should call fill_mg_cmtime() to
+fill those values. For setattr, it should use setattr_copy() to update the
+timestamps, or otherwise mimic its behavior.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 4cc657d743f7..939b4b624fad 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -1,33 +1,187 @@
.. SPDX-License-Identifier: GPL-2.0
-=================================
-Network Filesystem Helper Library
-=================================
+===================================
+Network Filesystem Services Library
+===================================
.. Contents:
- Overview.
+ - Requests and streams.
+ - Subrequests.
+ - Result collection and retry.
+ - Local caching.
+ - Content encryption (fscrypt).
- Per-inode context.
- Inode context helper functions.
- - Buffered read helpers.
- - Read helper functions.
- - Read helper structures.
- - Read helper operations.
- - Read helper procedure.
- - Read helper cache API.
+ - Inode locking.
+ - Inode writeback.
+ - High-level VFS API.
+ - Unlocked read/write iter.
+ - Pre-locked read/write iter.
+ - Monolithic files API.
+ - Memory-mapped I/O API.
+ - High-level VM API.
+ - Deprecated PG_private2 API.
+ - I/O request API.
+ - Request structure.
+ - Stream structure.
+ - Subrequest structure.
+ - Filesystem methods.
+ - Terminating a subrequest.
+ - Local cache API.
+ - API function reference.
Overview
========
-The network filesystem helper library is a set of functions designed to aid a
-network filesystem in implementing VM/VFS operations. For the moment, that
-just includes turning various VM buffered read operations into requests to read
-from the server. The helper library, however, can also interpose other
-services, such as local caching or local data encryption.
+The network filesystem services library, netfslib, is a set of functions
+designed to aid a network filesystem in implementing VM/VFS API operations. It
+takes over the normal buffered read, readahead, write and writeback and also
+handles unbuffered and direct I/O.
-Note that the library module doesn't link against local caching directly, so
-access must be provided by the netfs.
+The library provides support for (re-)negotiation of I/O sizes and retrying
+failed I/O as well as local caching and will, in the future, provide content
+encryption.
+
+It insulates the filesystem from VM interface changes as much as possible and
+handles VM features such as large multipage folios. The filesystem basically
+just has to provide a way to perform read and write RPC calls.
+
+The way I/O is organised inside netfslib consists of a number of objects:
+
+ * A *request*. A request is used to track the progress of the I/O overall and
+ to hold on to resources. The collection of results is done at the request
+ level. The I/O within a request is divided into a number of parallel
+ streams of subrequests.
+
+ * A *stream*. A non-overlapping series of subrequests. The subrequests
+ within a stream do not have to be contiguous.
+
+ * A *subrequest*. This is the basic unit of I/O. It represents a single RPC
+ call or a single cache I/O operation. The library passes these to the
+ filesystem and the cache to perform.
+
+Requests and Streams
+--------------------
+
+When actually performing I/O (as opposed to just copying into the pagecache),
+netfslib will create one or more requests to track the progress of the I/O and
+to hold resources.
+
+A read operation will have a single stream and the subrequests within that
+stream may be of mixed origins, for instance mixing RPC subrequests and cache
+subrequests.
+
+On the other hand, a write operation may have multiple streams, where each
+stream targets a different destination. For instance, there may be one stream
+writing to the local cache and one to the server. Currently, only two streams
+are allowed, but this could be increased if parallel writes to multiple servers
+is desired.
+
+The subrequests within a write stream do not need to match alignment or size
+with the subrequests in another write stream and netfslib performs the tiling
+of subrequests in each stream over the source buffer independently. Further,
+each stream may contain holes that don't correspond to holes in the other
+stream.
+
+In addition, the subrequests do not need to correspond to the boundaries of the
+folios or vectors in the source/destination buffer. The library handles the
+collection of results and the wrangling of folio flags and references.
+
+Subrequests
+-----------
+
+Subrequests are at the heart of the interaction between netfslib and the
+filesystem using it. Each subrequest is expected to correspond to a single
+read or write RPC or cache operation. The library will stitch together the
+results from a set of subrequests to provide a higher level operation.
+
+Netfslib has two interactions with the filesystem or the cache when setting up
+a subrequest. First, there's an optional preparatory step that allows the
+filesystem to negotiate the limits on the subrequest, both in terms of maximum
+number of bytes and maximum number of vectors (e.g. for RDMA). This may
+involve negotiating with the server (e.g. cifs needing to acquire credits).
+
+And, secondly, there's the issuing step in which the subrequest is handed off
+to the filesystem to perform.
+
+Note that these two steps are done slightly differently between read and write:
+
+ * For reads, the VM/VFS tells us how much is being requested up front, so the
+ library can preset maximum values that the cache and then the filesystem can
+ then reduce. The cache also gets consulted first on whether it wants to do
+ a read before the filesystem is consulted.
+
+ * For writeback, it is unknown how much there will be to write until the
+ pagecache is walked, so no limit is set by the library.
+
+Once a subrequest is completed, the filesystem or cache informs the library of
+the completion and then collection is invoked. Depending on whether the
+request is synchronous or asynchronous, the collection of results will be done
+in either the application thread or in a work queue.
+
+Result Collection and Retry
+---------------------------
+
+As subrequests complete, the results are collected and collated by the library
+and folio unlocking is performed progressively (if appropriate). Once the
+request is complete, async completion will be invoked (again, if appropriate).
+It is possible for the filesystem to provide interim progress reports to the
+library to cause folio unlocking to happen earlier if possible.
+
+If any subrequests fail, netfslib can retry them. It will wait until all
+subrequests are completed, offer the filesystem the opportunity to fiddle with
+the resources/state held by the request and poke at the subrequests before
+re-preparing and re-issuing the subrequests.
+
+This allows the tiling of contiguous sets of failed subrequest within a stream
+to be changed, adding more subrequests or ditching excess as necessary (for
+instance, if the network sizes change or the server decides it wants smaller
+chunks).
+
+Further, if one or more contiguous cache-read subrequests fail, the library
+will pass them to the filesystem to perform instead, renegotiating and retiling
+them as necessary to fit with the filesystem's parameters rather than those of
+the cache.
+
+Local Caching
+-------------
+
+One of the services netfslib provides, via ``fscache``, is the option to cache
+on local disk a copy of the data obtained from/written to a network filesystem.
+The library will manage the storing, retrieval and some invalidation of data
+automatically on behalf of the filesystem if a cookie is attached to the
+``netfs_inode``.
+
+Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
+keep track of a page that was being written to the cache, but this is now
+deprecated as PG_private_2 will be removed.
+
+Instead, folios that are read from the server for which there was no data in
+the cache will be marked as dirty and will have ``folio->private`` set to a
+special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
+If the folio is modified before that happened, the special value will be
+cleared and the write will become normally dirty.
+
+When writeback occurs, folios that are so marked will only be written to the
+cache and not to the server. Writeback handles mixed cache-only writes and
+server-and-cache writes by using two streams, sending one to the cache and one
+to the server. The server stream will have gaps in it corresponding to those
+folios.
+
+Content Encryption (fscrypt)
+----------------------------
+
+Though it does not do so yet, at some point netfslib will acquire the ability
+to do client-side content encryption on behalf of the network filesystem (Ceph,
+for example). fscrypt can be used for this if appropriate (it may not be -
+cifs, for example).
+
+The data will be stored encrypted in the local cache using the same manner of
+encryption as the data written to the server and the library will impose bounce
+buffering and RMW cycles as necessary.
Per-Inode Context
@@ -40,10 +194,13 @@ structure is defined::
struct netfs_inode {
struct inode inode;
const struct netfs_request_ops *ops;
- struct fscache_cookie *cache;
+ struct fscache_cookie * cache;
+ loff_t remote_i_size;
+ unsigned long flags;
+ ...
};
-A network filesystem that wants to use netfs lib must place one of these in its
+A network filesystem that wants to use netfslib must place one of these in its
inode wrapper struct instead of the VFS ``struct inode``. This can be done in
a way similar to the following::
@@ -56,7 +213,8 @@ This allows netfslib to find its state by using ``container_of()`` from the
inode pointer, thereby allowing the netfslib helper functions to be pointed to
directly by the VFS/VM operation tables.
-The structure contains the following fields:
+The structure contains the following fields that are of interest to the
+filesystem:
* ``inode``
@@ -71,6 +229,37 @@ The structure contains the following fields:
Local caching cookie, or NULL if no caching is enabled. This field does not
exist if fscache is disabled.
+ * ``remote_i_size``
+
+ The size of the file on the server. This differs from inode->i_size if
+ local modifications have been made but not yet written back.
+
+ * ``flags``
+
+ A set of flags, some of which the filesystem might be interested in:
+
+ * ``NETFS_ICTX_MODIFIED_ATTR``
+
+ Set if netfslib modifies mtime/ctime. The filesystem is free to ignore
+ this or clear it.
+
+ * ``NETFS_ICTX_UNBUFFERED``
+
+ Do unbuffered I/O upon the file. Like direct I/O but without the
+ alignment limitations. RMW will be performed if necessary. The pagecache
+ will not be used unless mmap() is also used.
+
+ * ``NETFS_ICTX_WRITETHROUGH``
+
+ Do writethrough caching upon the file. I/O will be set up and dispatched
+ as buffered writes are made to the page cache. mmap() does the normal
+ writeback thing.
+
+ * ``NETFS_ICTX_SINGLE_NO_UPLOAD``
+
+ Set if the file has a monolithic content that must be read entirely in a
+ single go and must not be written back to the server, though it can be
+ cached (e.g. AFS directories).
Inode Context Helper Functions
------------------------------
@@ -84,117 +273,250 @@ set the operations table pointer::
then a function to cast from the VFS inode structure to the netfs context::
- struct netfs_inode *netfs_node(struct inode *inode);
+ struct netfs_inode *netfs_inode(struct inode *inode);
and finally, a function to get the cache cookie pointer from the context
attached to an inode (or NULL if fscache is disabled)::
struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+Inode Locking
+-------------
+
+A number of functions are provided to manage the locking of i_rwsem for I/O and
+to effectively extend it to provide more separate classes of exclusion::
+
+ int netfs_start_io_read(struct inode *inode);
+ void netfs_end_io_read(struct inode *inode);
+ int netfs_start_io_write(struct inode *inode);
+ void netfs_end_io_write(struct inode *inode);
+ int netfs_start_io_direct(struct inode *inode);
+ void netfs_end_io_direct(struct inode *inode);
+
+The exclusion breaks down into four separate classes:
+
+ 1) Buffered reads and writes.
+
+ Buffered reads can run concurrently each other and with buffered writes,
+ but buffered writes cannot run concurrently with each other.
+
+ 2) Direct reads and writes.
+
+ Direct (and unbuffered) reads and writes can run concurrently since they do
+ not share local buffering (i.e. the pagecache) and, in a network
+ filesystem, are expected to have exclusion managed on the server (though
+ this may not be the case for, say, Ceph).
+
+ 3) Other major inode modifying operations (e.g. truncate, fallocate).
+
+ These should just access i_rwsem directly.
+
+ 4) mmap().
+
+ mmap'd accesses might operate concurrently with any of the other classes.
+ They might form the buffer for an intra-file loopback DIO read/write. They
+ might be permitted on unbuffered files.
+
+Inode Writeback
+---------------
+
+Netfslib will pin resources on an inode for future writeback (such as pinning
+use of an fscache cookie) when an inode is dirtied. However, this pinning
+needs careful management. To manage the pinning, the following sequence
+occurs:
+
+ 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
+ pinning begins (when a folio is dirtied, for example) if the cache is
+ active to stop the cache structures from being discarded and the cache
+ space from being culled. This also prevents re-getting of cache resources
+ if the flag is already set.
+
+ 2) This flag then cleared inside the inode lock during inode writeback in the
+ VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
+ in ``struct writeback_control``.
+
+ 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
+
+ 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
+
+ 5) The filesystem invokes netfs to do its cleanup.
+
+To do the cleanup, netfslib provides a function to do the resource unpinning::
+
+ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
+
+If the filesystem doesn't need to do anything else, this may be set as a its
+``.write_inode`` method.
+
+Further, if an inode is deleted, the filesystem's write_inode method may not
+get called, so::
+
+ void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
-Buffered Read Helpers
-=====================
+must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
-The library provides a set of read helpers that handle the ->read_folio(),
-->readahead() and much of the ->write_begin() VM operations and translate them
-into a common call framework.
-The following services are provided:
+High-Level VFS API
+==================
- * Handle folios that span multiple pages.
+Netfslib provides a number of sets of API calls for the filesystem to delegate
+VFS operations to. Netfslib, in turn, will call out to the filesystem and the
+cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
+at various times.
- * Insulate the netfs from VM interface changes.
+Unlocked Read/Write Iter
+------------------------
- * Allow the netfs to arbitrarily split reads up into pieces, even ones that
- don't match folio sizes or folio alignments and that may cross folios.
+The first API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS read/write_iter methods::
- * Allow the netfs to expand a readahead request in both directions to meet its
- needs.
+ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
- * Allow the netfs to partially fulfil a read, which will then be resubmitted.
+They can be assigned directly to ``.read_iter`` and ``.write_iter``. They
+perform the inode locking themselves and the first two will switch between
+buffered I/O and DIO as appropriate.
- * Handle local caching, allowing cached data and server-read data to be
- interleaved for a single request.
+Pre-Locked Read/Write Iter
+--------------------------
- * Handle clearing of bufferage that aren't on the server.
+The second API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS methods, but needs to do some
+other stuff before or after calling netfslib whilst still inside locked section
+(e.g. Ceph negotiating caps). The unbuffered read function is::
- * Handle retrying of reads that failed, switching reads from the cache to the
- server as necessary.
+ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
- * In the future, this is a place that other services can be performed, such as
- local encryption of data to be stored remotely or in the cache.
+This must not be assigned directly to ``.read_iter`` and the filesystem is
+responsible for performing the inode locking before calling it. In the case of
+buffered read, the filesystem should use ``filemap_read()``.
-From the network filesystem, the helpers require a table of operations. This
-includes a mandatory method to issue a read operation along with a number of
-optional methods.
+There are three functions for writes::
+ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
-Read Helper Functions
+These must not be assigned directly to ``.write_iter`` and the filesystem is
+responsible for performing the inode locking before calling them.
+
+The first two functions are for buffered writes; the first just adds some
+standard write checks and jumps to the second, but if the filesystem wants to
+do the checks itself, it can use the second directly. The third function is
+for unbuffered or DIO writes.
+
+On all three write functions, there is a writeback group pointer (which should
+be NULL if the filesystem doesn't use this). Writeback groups are set on
+folios when they're modified. If a folio to-be-modified is already marked with
+a different group, it is flushed first. The writeback API allows writing back
+of a specific group.
+
+Memory-Mapped I/O API
---------------------
-Three read helpers are provided::
+An API for support of mmap()'d I/O is provided::
+
+ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
+
+This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The
+filesystem should not take the inode lock before calling it, but, as with the
+locked write functions above, this does take a writeback group pointer. If the
+page to be made writable is in a different group, it will be flushed first.
+
+Monolithic Files API
+--------------------
+
+There is also a special API set for files for which the content must be read in
+a single RPC (and not written back) and is maintained as a monolithic blob
+(e.g. an AFS directory), though it can be stored and updated in the local cache::
+
+ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
+ void netfs_single_mark_inode_dirty(struct inode *inode);
+ int netfs_writeback_single(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct iov_iter *iter);
+
+The first function reads from a file into the given buffer, reading from the
+cache in preference if the data is cached there; the second function allows the
+inode to be marked dirty, causing a later writeback; and the third function can
+be called from the writeback code to write the data to the cache, if there is
+one.
- void netfs_readahead(struct readahead_control *ractl);
- int netfs_read_folio(struct file *file,
- struct folio *folio);
- int netfs_write_begin(struct netfs_inode *ctx,
- struct file *file,
- struct address_space *mapping,
- loff_t pos,
- unsigned int len,
- struct folio **_folio,
- void **_fsdata);
+The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
+used. The writeback function requires the buffer to be of ITER_FOLIOQ type.
-Each corresponds to a VM address space operation. These operations use the
-state in the per-inode context.
+High-Level VM API
+==================
-For ->readahead() and ->read_folio(), the network filesystem just point directly
-at the corresponding read helper; whereas for ->write_begin(), it may be a
-little more complicated as the network filesystem might want to flush
-conflicting writes or track dirty data and needs to put the acquired folio if
-an error occurs after calling the helper.
+Netfslib also provides a number of sets of API calls for the filesystem to
+delegate VM operations to. Again, netfslib, in turn, will call out to the
+filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
+for it to intervene at various times::
-The helpers manage the read request, calling back into the network filesystem
-through the supplied table of operations. Waits will be performed as
-necessary before returning for helpers that are meant to be synchronous.
+ void netfs_readahead(struct readahead_control *);
+ int netfs_read_folio(struct file *, struct folio *);
+ int netfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc);
+ bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
+ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
+ bool netfs_release_folio(struct folio *folio, gfp_t gfp);
-If an error occurs, the ->free_request() will be called to clean up the
-netfs_io_request struct allocated. If some parts of the request are in
-progress when an error occurs, the request will get partially completed if
-sufficient data is read.
+These are ``address_space_operations`` methods and can be set directly in the
+operations table.
-Additionally, there is::
+Deprecated PG_private_2 API
+---------------------------
- * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
- ssize_t transferred_or_error,
- bool was_async);
+There is also a deprecated function for filesystems that still use the
+``->write_begin`` method::
-which should be called to complete a read subrequest. This is given the number
-of bytes transferred or a negative error code, plus a flag indicating whether
-the operation was asynchronous (ie. whether the follow-on processing can be
-done in the current context, given this may involve sleeping).
+ int netfs_write_begin(struct netfs_inode *inode, struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned int len,
+ struct folio **_folio, void **_fsdata);
+It uses the deprecated PG_private_2 flag and so should not be used.
-Read Helper Structures
-----------------------
-The read helpers make use of a couple of structures to maintain the state of
-the read. The first is a structure that manages a read request as a whole::
+I/O Request API
+===============
+
+The I/O request API comprises a number of structures and a number of functions
+that the filesystem may need to use.
+
+Request Structure
+-----------------
+
+The request structure manages the request as a whole, holding some resources
+and state on behalf of the filesystem and tracking the collection of results::
struct netfs_io_request {
+ enum netfs_io_origin origin;
struct inode *inode;
struct address_space *mapping;
- struct netfs_cache_resources cache_resources;
+ struct netfs_group *group;
+ struct netfs_io_stream io_streams[];
void *netfs_priv;
- loff_t start;
- size_t len;
- loff_t i_size;
- const struct netfs_request_ops *netfs_ops;
+ void *netfs_priv2;
+ unsigned long long start;
+ unsigned long long len;
+ unsigned long long i_size;
unsigned int debug_id;
+ unsigned long flags;
...
};
-The above fields are the ones the netfs can use. They are:
+Many of the fields are for internal use, but the fields shown here are of
+interest to the filesystem:
+
+ * ``origin``
+
+ The origin of the request (readahead, read_folio, DIO read, writeback, ...).
* ``inode``
* ``mapping``
@@ -202,11 +524,19 @@ The above fields are the ones the netfs can use. They are:
The inode and the address space of the file being read from. The mapping
may or may not point to inode->i_data.
- * ``cache_resources``
+ * ``group``
+
+ The writeback group this request is dealing with or NULL. This holds a ref
+ on the group.
+
+ * ``io_streams``
- Resources for the local cache to use, if present.
+ The parallel streams of subrequests available to the request. Currently two
+ are available, but this may be made extensible in future. ``NR_IO_STREAMS``
+ indicates the size of the array.
* ``netfs_priv``
+ * ``netfs_priv2``
The network filesystem's private data. The value for this can be passed in
to the helper functions or set during the request.
@@ -221,37 +551,121 @@ The above fields are the ones the netfs can use. They are:
The size of the file at the start of the request.
- * ``netfs_ops``
-
- A pointer to the operation table. The value for this is passed into the
- helper functions.
-
* ``debug_id``
A number allocated to this operation that can be displayed in trace lines
for reference.
+ * ``flags``
+
+ Flags for managing and controlling the operation of the request. Some of
+ these may be of interest to the filesystem:
+
+ * ``NETFS_RREQ_RETRYING``
+
+ Netfslib sets this when generating retries.
+
+ * ``NETFS_RREQ_PAUSE``
+
+ The filesystem can set this to request to pause the library's subrequest
+ issuing loop - but care needs to be taken as netfslib may also set it.
+
+ * ``NETFS_RREQ_NONBLOCK``
+ * ``NETFS_RREQ_BLOCKED``
+
+ Netfslib sets the first to indicate that non-blocking mode was set by the
+ caller and the filesystem can set the second to indicate that it would
+ have had to block.
+
+ * ``NETFS_RREQ_USE_PGPRIV2``
+
+ The filesystem can set this if it wants to use PG_private_2 to track
+ whether a folio is being written to the cache. This is deprecated as
+ PG_private_2 is going to go away.
+
+If the filesystem wants more private data than is afforded by this structure,
+then it should wrap it and provide its own allocator.
+
+Stream Structure
+----------------
+
+A request is comprised of one or more parallel streams and each stream may be
+aimed at a different target.
+
+For read requests, only stream 0 is used. This can contain a mixture of
+subrequests aimed at different sources. For write requests, stream 0 is used
+for the server and stream 1 is used for the cache. For buffered writeback,
+stream 0 is not enabled unless a normal dirty folio is encountered, at which
+point ->begin_writeback() will be invoked and the filesystem can mark the
+stream available.
+
+The stream struct looks like::
+
+ struct netfs_io_stream {
+ unsigned char stream_nr;
+ bool avail;
+ size_t sreq_max_len;
+ unsigned int sreq_max_segs;
+ unsigned int submit_extendable_to;
+ ...
+ };
+
+A number of members are available for access/use by the filesystem:
+
+ * ``stream_nr``
+
+ The number of the stream within the request.
+
+ * ``avail``
+
+ True if the stream is available for use. The filesystem should set this on
+ stream zero if in ->begin_writeback().
+
+ * ``sreq_max_len``
+ * ``sreq_max_segs``
+
+ These are set by the filesystem or the cache in ->prepare_read() or
+ ->prepare_write() for each subrequest to indicate the maximum number of
+ bytes and, optionally, the maximum number of segments (if not 0) that that
+ subrequest can support.
+
+ * ``submit_extendable_to``
-The second structure is used to manage individual slices of the overall read
-request::
+ The size that a subrequest can be rounded up to beyond the EOF, given the
+ available buffer. This allows the cache to work out if it can do a DIO read
+ or write that straddles the EOF marker.
+
+Subrequest Structure
+--------------------
+
+Individual units of I/O are managed by the subrequest structure. These
+represent slices of the overall request and run independently::
struct netfs_io_subrequest {
struct netfs_io_request *rreq;
- loff_t start;
+ struct iov_iter io_iter;
+ unsigned long long start;
size_t len;
size_t transferred;
unsigned long flags;
+ short error;
unsigned short debug_index;
+ unsigned char stream_nr;
...
};
-Each subrequest is expected to access a single source, though the helpers will
+Each subrequest is expected to access a single source, though the library will
handle falling back from one source type to another. The members are:
* ``rreq``
A pointer to the read request.
+ * ``io_iter``
+
+ An I/O iterator representing a slice of the buffer to be read into or
+ written from.
+
* ``start``
* ``len``
@@ -260,241 +674,300 @@ handle falling back from one source type to another. The members are:
* ``transferred``
- The amount of data transferred so far of the length of this slice. The
- network filesystem or cache should start the operation this far into the
- slice. If a short read occurs, the helpers will call again, having updated
- this to reflect the amount read so far.
+ The amount of data transferred so far for this subrequest. This should be
+ added to with the length of the transfer made by this issuance of the
+ subrequest. If this is less than ``len`` then the subrequest may be
+ reissued to continue.
* ``flags``
- Flags pertaining to the read. There are two of interest to the filesystem
- or cache:
+ Flags for managing the subrequest. There are a number of interest to the
+ filesystem or cache:
+
+ * ``NETFS_SREQ_MADE_PROGRESS``
+
+ Set by the filesystem to indicates that at least one byte of data was read
+ or written.
+
+ * ``NETFS_SREQ_HIT_EOF``
+
+ The filesystem should set this if a read hit the EOF on the file (in which
+ case ``transferred`` should stop at the EOF). Netfslib may expand the
+ subrequest out to the size of the folio containing the EOF on the off
+ chance that a third party change happened or a DIO read may have asked for
+ more than is available. The library will clear any excess pagecache.
* ``NETFS_SREQ_CLEAR_TAIL``
- This can be set to indicate that the remainder of the slice, from
- transferred to len, should be cleared.
+ The filesystem can set this to indicate that the remainder of the slice,
+ from transferred to len, should be cleared. Do not set if HIT_EOF is set.
+
+ * ``NETFS_SREQ_NEED_RETRY``
+
+ The filesystem can set this to tell netfslib to retry the subrequest.
+
+ * ``NETFS_SREQ_BOUNDARY``
+
+ This can be set by the filesystem on a subrequest to indicate that it ends
+ at a boundary with the filesystem structure (e.g. at the end of a Ceph
+ object). It tells netfslib not to retile subrequests across it.
* ``NETFS_SREQ_SEEK_DATA_READ``
- This is a hint to the cache that it might want to try skipping ahead to
- the next data (ie. using SEEK_DATA).
+ This is a hint from netfslib to the cache that it might want to try
+ skipping ahead to the next data (ie. using SEEK_DATA).
+
+ * ``error``
+
+ This is for the filesystem to store result of the subrequest. It should be
+ set to 0 if successful and a negative error code otherwise.
* ``debug_index``
+ * ``stream_nr``
A number allocated to this slice that can be displayed in trace lines for
- reference.
+ reference and the number of the request stream that it belongs to.
+If necessary, the filesystem can get and put extra refs on the subrequest it is
+given::
-Read Helper Operations
-----------------------
+ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
+ void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
-The network filesystem must provide the read helpers with a table of operations
-through which it can issue requests and negotiate::
+using netfs trace codes to indicate the reason. Care must be taken, however,
+as once control of the subrequest is returned to netfslib, the same subrequest
+can be reissued/retried.
+
+Filesystem Methods
+------------------
+
+The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
+use::
struct netfs_request_ops {
- void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ mempool_t *request_pool;
+ mempool_t *subrequest_pool;
+ int (*init_request)(struct netfs_io_request *rreq, struct file *file);
void (*free_request)(struct netfs_io_request *rreq);
+ void (*free_subrequest)(struct netfs_io_subrequest *rreq);
void (*expand_readahead)(struct netfs_io_request *rreq);
- bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+ int (*prepare_read)(struct netfs_io_subrequest *subreq);
void (*issue_read)(struct netfs_io_subrequest *subreq);
- bool (*is_still_valid)(struct netfs_io_request *rreq);
- int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
- struct folio **foliop, void **_fsdata);
void (*done)(struct netfs_io_request *rreq);
+ void (*update_i_size)(struct inode *inode, loff_t i_size);
+ void (*post_modify)(struct inode *inode);
+ void (*begin_writeback)(struct netfs_io_request *wreq);
+ void (*prepare_write)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
+ void (*retry_request)(struct netfs_io_request *wreq,
+ struct netfs_io_stream *stream);
+ void (*invalidate_cache)(struct netfs_io_request *wreq);
};
-The operations are as follows:
-
- * ``init_request()``
+The table starts with a pair of optional pointers to memory pools from which
+requests and subrequests can be allocated. If these are not given, netfslib
+has default pools that it will use instead. If the filesystem wraps the netfs
+structs in its own larger structs, then it will need to use its own pools.
+Netfslib will allocate directly from the pools.
- [Optional] This is called to initialise the request structure. It is given
- the file for reference.
+The methods defined in the table are:
+ * ``init_request()``
* ``free_request()``
+ * ``free_subrequest()``
- [Optional] This is called as the request is being deallocated so that the
- filesystem can clean up any state it has attached there.
+ [Optional] A filesystem may implement these to initialise or clean up any
+ resources that it attaches to the request or subrequest.
* ``expand_readahead()``
[Optional] This is called to allow the filesystem to expand the size of a
- readahead read request. The filesystem gets to expand the request in both
- directions, though it's not permitted to reduce it as the numbers may
- represent an allocation already made. If local caching is enabled, it gets
- to expand the request first.
+ readahead request. The filesystem gets to expand the request in both
+ directions, though it must retain the initial region as that may represent
+ an allocation already made. If local caching is enabled, it gets to expand
+ the request first.
Expansion is communicated by changing ->start and ->len in the request
structure. Note that if any change is made, ->len must be increased by at
least as much as ->start is reduced.
- * ``clamp_length()``
-
- [Optional] This is called to allow the filesystem to reduce the size of a
- subrequest. The filesystem can use this, for example, to chop up a request
- that has to be split across multiple servers or to put multiple reads in
- flight.
-
- This should return 0 on success and an error code on error.
-
- * ``issue_read()``
+ * ``prepare_read()``
- [Required] The helpers use this to dispatch a subrequest to the server for
- reading. In the subrequest, ->start, ->len and ->transferred indicate what
- data should be read from the server.
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream zero in::
- There is no return value; the netfs_subreq_terminated() function should be
- called to indicate whether or not the operation succeeded and how much data
- it transferred. The filesystem also should not deal with setting folios
- uptodate, unlocking them or dropping their refs - the helpers need to deal
- with this as they have to coordinate with copying to the local cache.
+ rreq->io_streams[0].sreq_max_len
+ rreq->io_streams[0].sreq_max_segs
- Note that the helpers have the folios locked, but not pinned. It is
- possible to use the ITER_XARRAY iov iterator to refer to the range of the
- inode that is being operated upon without the need to allocate large bvec
- tables.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple reads in flight.
- * ``is_still_valid()``
+ Zero should be returned on success and an error code otherwise.
- [Optional] This is called to find out if the data just read from the local
- cache is still valid. It should return true if it is still valid and false
- if not. If it's not still valid, it will be reread from the server.
+ * ``issue_read()``
- * ``check_write_begin()``
+ [Required] Netfslib calls this to dispatch a subrequest to the server for
+ reading. In the subrequest, ->start, ->len and ->transferred indicate what
+ data should be read from the server and ->io_iter indicates the buffer to be
+ used.
- [Optional] This is called from the netfs_write_begin() helper once it has
- allocated/grabbed the folio to be modified to allow the filesystem to flush
- conflicting state before allowing it to be modified.
+ There is no return value; the ``netfs_read_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- It may unlock and discard the folio it was given and set the caller's folio
- pointer to NULL. It should return 0 if everything is now fine (``*foliop``
- left set) or the op should be retried (``*foliop`` cleared) and any other
- error code to abort the operation.
+ Note: the filesystem must not deal with setting folios uptodate, unlocking
+ them or dropping their refs - the library deals with this as it may have to
+ stitch together the results of multiple subrequests that variously overlap
+ the set of folios.
- * ``done``
+ * ``done()``
- [Optional] This is called after the folios in the request have all been
+ [Optional] This is called after the folios in a read request have all been
unlocked (and marked uptodate if applicable).
+ * ``update_i_size()``
+
+ [Optional] This is invoked by netfslib at various points during the write
+ paths to ask the filesystem to update its idea of the file size. If not
+ given, netfslib will set i_size and i_blocks and update the local cache
+ cookie.
+
+ * ``post_modify()``
+
+ [Optional] This is called after netfslib writes to the pagecache or when it
+ allows an mmap'd page to be marked as writable.
+
+ * ``begin_writeback()``
+
+ [Optional] Netfslib calls this when processing a writeback request if it
+ finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
+ indicating it must be written to the server. This allows the filesystem to
+ only set up writeback resources when it knows it's going to have to perform
+ a write.
+
+ * ``prepare_write()``
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream to which
+ the subrequest belongs::
-Read Helper Procedure
----------------------
-
-The read helpers work by the following general procedure:
-
- * Set up the request.
-
- * For readahead, allow the local cache and then the network filesystem to
- propose expansions to the read request. This is then proposed to the VM.
- If the VM cannot fully perform the expansion, a partially expanded read will
- be performed, though this may not get written to the cache in its entirety.
-
- * Loop around slicing chunks off of the request to form subrequests:
-
- * If a local cache is present, it gets to do the slicing, otherwise the
- helpers just try to generate maximal slices.
-
- * The network filesystem gets to clamp the size of each slice if it is to be
- the source. This allows rsize and chunking to be implemented.
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- * The helpers issue a read from the cache or a read from the server or just
- clears the slice as appropriate.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * The next slice begins at the end of the last one.
+ This is not permitted to return an error. Instead, in the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- * As slices finish being read, they terminate.
+ * ``issue_write()``
- * When all the subrequests have terminated, the subrequests are assessed and
- any that are short or have failed are reissued:
+ [Required] This is used to dispatch a subrequest to the server for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the server and ->io_iter indicates the buffer to be
+ used.
- * Failed cache requests are issued against the server instead.
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
- * Failed server requests just fail.
+ Note: the filesystem must not deal with removing the dirty or writeback
+ marks on folios involved in the operation and should not take refs or pins
+ on them, but should leave retention to netfslib.
- * Short reads against either source will be reissued against that source
- provided they have transferred some more data:
+ * ``retry_request()``
- * The cache may need to skip holes that it can't do DIO from.
+ [Optional] Netfslib calls this at the beginning of a retry cycle. This
+ allows the filesystem to examine the state of the request, the subrequests
+ in the indicated stream and of its own data and make adjustments or
+ renegotiate resources.
+
+ * ``invalidate_cache()``
- * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
- end of the slice instead of reissuing.
+ [Optional] This is called by netfslib to invalidate data stored in the local
+ cache in the event that writing to the local cache fails, providing updated
+ coherency data that netfs can't provide.
- * Once the data is read, the folios that have been fully read/cleared:
+Terminating a subrequest
+------------------------
- * Will be marked uptodate.
+When a subrequest completes, there are a number of functions that the cache or
+subrequest can call to inform netfslib of the status change. One function is
+provided to terminate a write subrequest at the preparation stage and acts
+synchronously:
- * If a cache is present, will be marked with PG_fscache.
+ * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
- * Unlocked
+ Indicate that the ->prepare_write() call failed. The ``error`` field should
+ have been updated.
- * Any folios that need writing to the cache will then have DIO writes issued.
+Note that ->prepare_read() can return an error as a read can simply be aborted.
+Dealing with writeback failure is trickier.
- * Synchronous operations will wait for reading to be complete.
+The other functions are used for subrequests that got as far as being issued:
- * Writes to the cache will proceed asynchronously and the folios will have the
- PG_fscache mark removed when that completes.
+ * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
- * The request structures will be cleaned up when everything has completed.
+ Tell netfslib that a read subrequest has terminated. The ``error``,
+ ``flags`` and ``transferred`` fields should have been updated.
+ * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
-Read Helper Cache API
----------------------
+ Tell netfslib that a write subrequest has terminated. Either the amount of
+ data processed or the negative error code can be passed in. This is
+ can be used as a kiocb completion function.
-When implementing a local cache to be used by the read helpers, two things are
-required: some way for the network filesystem to initialise the caching for a
-read request and a table of operations for the helpers to call.
+ * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
-To begin a cache operation on an fscache object, the following function is
-called::
+ This is provided to optionally update netfslib on the incremental progress
+ of a read, allowing some folios to be unlocked early and does not actually
+ terminate the subrequest. The ``transferred`` field should have been
+ updated.
- int fscache_begin_read_operation(struct netfs_io_request *rreq,
- struct fscache_cookie *cookie);
+Local Cache API
+---------------
-passing in the request pointer and the cookie corresponding to the file. This
-fills in the cache resources mentioned below.
+Netfslib provides a separate API for a local cache to implement, though it
+provides some somewhat similar routines to the filesystem request API.
-The netfs_io_request object contains a place for the cache to hang its
+Firstly, the netfs_io_request object contains a place for the cache to hang its
state::
struct netfs_cache_resources {
const struct netfs_cache_ops *ops;
void *cache_priv;
void *cache_priv2;
+ unsigned int debug_id;
+ unsigned int inval_counter;
};
-This contains an operations table pointer and two private pointers. The
-operation table looks like the following::
+This contains an operations table pointer and two private pointers plus the
+debug ID of the fscache cookie for tracing purposes and an invalidation counter
+that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
+to be invalidated after completion.
+
+The cache operation table looks like the following::
struct netfs_cache_ops {
void (*end_operation)(struct netfs_cache_resources *cres);
-
void (*expand_readahead)(struct netfs_cache_resources *cres,
loff_t *_start, size_t *_len, loff_t i_size);
-
enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
- loff_t i_size);
-
+ loff_t i_size);
int (*read)(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
bool seek_data,
netfs_io_terminated_t term_func,
void *term_func_priv);
-
- int (*prepare_write)(struct netfs_cache_resources *cres,
- loff_t *_start, size_t *_len, loff_t i_size,
- bool no_space_allocated_yet);
-
- int (*write)(struct netfs_cache_resources *cres,
- loff_t start_pos,
- struct iov_iter *iter,
- netfs_io_terminated_t term_func,
- void *term_func_priv);
-
- int (*query_occupancy)(struct netfs_cache_resources *cres,
- loff_t start, size_t len, size_t granularity,
- loff_t *_data_start, size_t *_data_len);
+ void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
};
With a termination handler function pointer::
@@ -511,11 +984,17 @@ The methods defined in the table are:
* ``expand_readahead()``
- [Optional] Called at the beginning of a netfs_readahead() operation to allow
- the cache to expand a request in either direction. This allows the cache to
+ [Optional] Called at the beginning of a readahead operation to allow the
+ cache to expand a request in either direction. This allows the cache to
size the request appropriately for the cache granularity.
- The function is passed poiners to the start and length in its parameters,
+ * ``prepare_read()``
+
+ [Required] Called to configure the next slice of a request. ->start and
+ ->len in the subrequest indicate where and how big the next slice can be;
+ the cache gets to reduce the length to match its granularity requirements.
+
+ The function is passed pointers to the start and length in its parameters,
plus the size of the file for reference, and adjusts the start and length
appropriately. It should return one of:
@@ -528,12 +1007,6 @@ The methods defined in the table are:
downloaded from the server or read from the cache - or whether slicing
should be given up at the current point.
- * ``prepare_read()``
-
- [Required] Called to configure the next slice of a request. ->start and
- ->len in the subrequest indicate where and how big the next slice can be;
- the cache gets to reduce the length to match its granularity requirements.
-
* ``read()``
[Required] Called to read from the cache. The start file offset is given
@@ -547,44 +1020,33 @@ The methods defined in the table are:
indicating whether the termination is definitely happening in the caller's
context.
- * ``prepare_write()``
+ * ``prepare_write_subreq()``
- [Required] Called to prepare a write to the cache to take place. This
- involves checking to see whether the cache has sufficient space to honour
- the write. ``*_start`` and ``*_len`` indicate the region to be written; the
- region can be shrunk or it can be expanded to a page boundary either way as
- necessary to align for direct I/O. i_size holds the size of the object and
- is provided for reference. no_space_allocated_yet is set to true if the
- caller is certain that no data has been written to that region - for example
- if it tried to do a read from there already.
+ [Required] This is called to allow the cache to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by DIO/DMA. This information should be set on stream to
+ which the subrequest belongs::
- * ``write()``
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
- [Required] Called to write to the cache. The start file offset is given
- along with an iterator to write from, which gives the length also.
-
- Also provided is a pointer to a termination handler function and private
- data to pass to that function. The termination function should be called
- with the number of bytes transferred or an error code, plus a flag
- indicating whether the termination is definitely happening in the caller's
- context.
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
- * ``query_occupancy()``
+ This is not permitted to return an error. In the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
- [Required] Called to find out where the next piece of data is within a
- particular region of the cache. The start and length of the region to be
- queried are passed in, along with the granularity to which the answer needs
- to be aligned. The function passes back the start and length of the data,
- if any, available within that region. Note that there may be a hole at the
- front.
+ * ``issue_write()``
- It returns 0 if some data was found, -ENODATA if there was no usable data
- within the region or -ENOBUFS if there is no caching on this file.
+ [Required] This is used to dispatch a subrequest to the cache for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the cache and ->io_iter indicates the buffer to be
+ used.
-Note that these methods are passed a pointer to the cache resource structure,
-not the read request structure as they could be used in other situations where
-there isn't a read request structure as well, such as writing dirty data to the
-cache.
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
API Function Reference
@@ -592,4 +1054,3 @@ API Function Reference
.. kernel-doc:: include/linux/netfs.h
.. kernel-doc:: fs/netfs/buffered_read.c
-.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index f04ce1215a03..de64d2d002a2 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -238,10 +238,3 @@ following flags are defined:
all of an inode's dirty data on last close. Exports that behave this
way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
waiting for writeback when closing such files.
-
- EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
- requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
- it's own ->lock() functionality as core posix_lock_file() implementation
- has no async lock request handling yet. For more information about how to
- indicate an async lock request from a ->lock() file_operations struct, see
- fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 8536134f31fd..95c2c009874c 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -8,6 +8,7 @@ NFS
client-identifier
exporting
+ localio
pnfs
rpc-cache
rpc-server-gss
diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..79808b37d745
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,357 @@
+===========
+NFS LOCALIO
+===========
+
+Overview
+========
+
+The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
+server to reliably handshake to determine if they are on the same
+host. Select "NFS client and server support for LOCALIO auxiliary
+protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
+config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
+
+Once an NFS client and server handshake as "local", the client will
+bypass the network RPC protocol for read, write and commit operations.
+Due to this XDR and RPC bypass, these operations will operate faster.
+
+The LOCALIO auxiliary protocol's implementation, which uses the same
+connection as NFS traffic, follows the pattern established by the NFS
+ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers. In a private implementation that
+preceded use of this LOCALIO protocol, a fragile sockaddr network
+address based match against all local network interfaces was attempted.
+But unlike the LOCALIO protocol, the sockaddr-based matching didn't
+handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate use case this locality makes possible is the
+client is able to open files and issue reads, writes and commits
+directly to the server without having to go over the network. The
+requirement is to perform these loopback NFS operations as efficiently
+as possible, this is particularly useful for container use cases
+(e.g. kubernetes) where it is possible to run an IO job local to the
+server.
+
+The performance advantage realized from LOCALIO's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+
+fio for 20 secs with directio, qd of 8, 16 libaio threads:
+ - With LOCALIO:
+ 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
+ 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
+ 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
+ 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
+ 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
+ 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
+ 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
+
+fio for 20 secs with directio, qd of 8, 1 libaio thread:
+ - With LOCALIO:
+ 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
+ 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
+ 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
+ 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
+ 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
+
+FAQ
+===
+
+1. What are the use cases for LOCALIO?
+
+ a. Workloads where the NFS client and server are on the same host
+ realize improved IO performance. In particular, it is common when
+ running containerised workloads for jobs to find themselves
+ running on the same host as the knfsd server being used for
+ storage.
+
+2. What are the requirements for LOCALIO?
+
+ a. Bypass use of the network RPC protocol as much as possible. This
+ includes bypassing XDR and RPC for open, read, write and commit
+ operations.
+ b. Allow client and server to autonomously discover if they are
+ running local to each other without making any assumptions about
+ the local network topology.
+ c. Support the use of containers by being compatible with relevant
+ namespaces (e.g. network, user, mount).
+ d. Support all versions of NFS. NFSv3 is of particular importance
+ because it has wide enterprise usage and pNFS flexfiles makes use
+ of it for the data path.
+
+3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
+ deciding if the NFS client and server are co-located on the same
+ host?
+
+ Since one of the main use cases is containerised workloads, we cannot
+ assume that IP addresses will be shared between the client and
+ server. This sets up a requirement for a handshake protocol that
+ needs to go over the same connection as the NFS traffic in order to
+ identify that the client and the server really are running on the
+ same host. The handshake uses a secret that is sent over the wire,
+ and can be verified by both parties by comparing with a value stored
+ in shared kernel memory if they are truly co-located.
+
+4. Does LOCALIO improve pNFS flexfiles?
+
+ Yes, LOCALIO complements pNFS flexfiles by allowing it to take
+ advantage of NFS client and server locality. Policy that initiates
+ client IO as closely to the server where the data is stored naturally
+ benefits from the data path optimization LOCALIO provides.
+
+5. Why not develop a new pNFS layout to enable LOCALIO?
+
+ A new pNFS layout could be developed, but doing so would put the
+ onus on the server to somehow discover that the client is co-located
+ when deciding to hand out the layout.
+ There is value in a simpler approach (as provided by LOCALIO) that
+ allows the NFS client to negotiate and leverage locality without
+ requiring more elaborate modeling and discovery of such locality in a
+ more centralized manner.
+
+6. Why is having the client perform a server-side file OPEN, without
+ using RPC, beneficial? Is the benefit pNFS specific?
+
+ Avoiding the use of XDR and RPC for file opens is beneficial to
+ performance regardless of whether pNFS is used. Especially when
+ dealing with small files its best to avoid going over the wire
+ whenever possible, otherwise it could reduce or even negate the
+ benefits of avoiding the wire for doing the small file I/O itself.
+ Given LOCALIO's requirements the current approach of having the
+ client perform a server-side file open, without using RPC, is ideal.
+ If in the future requirements change then we can adapt accordingly.
+
+7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
+
+ Strong authentication is usually tied to the connection itself. It
+ works by establishing a context that is cached by the server, and
+ that acts as the key for discovering the authorisation token, which
+ can then be passed to rpc.mountd to complete the authentication
+ process. On the other hand, in the case of AUTH_UNIX, the credential
+ that was passed over the wire is used directly as the key in the
+ upcall to rpc.mountd. This simplifies the authentication process, and
+ so makes AUTH_UNIX easier to support.
+
+8. How do export options that translate RPC user IDs behave for LOCALIO
+ operations (eg. root_squash, all_squash)?
+
+ Export options that translate user IDs are managed by nfsd_setuser()
+ which is called by nfsd_setuser_and_check_port() which is called by
+ __fh_verify(). So they get handled exactly the same way for LOCALIO
+ as they do for non-LOCALIO.
+
+9. How does LOCALIO make certain that object lifetimes are managed
+ properly given NFSD and NFS operate in different contexts?
+
+ See the detailed "NFS Client and Server Interlock" section below.
+
+RPC
+===
+
+The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
+RPC method that allows the Linux NFS client to verify the local Linux
+NFS server can see the nonce (single-use UUID) the client generated and
+made available in nfs_common. This protocol isn't part of an IETF
+standard, nor does it need to be considering it is Linux-to-Linux
+auxiliary RPC protocol that amounts to an implementation detail.
+
+The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
+the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
+XDR methods are used instead of the less efficient variable sized
+methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
+by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
+Linux Kernel Organization 400122 nfslocalio
+
+The LOCALIO protocol spec in rpcgen syntax is::
+
+ /* raw RFC 9562 UUID */
+ #define UUID_SIZE 16
+ typedef u8 uuid_t<UUID_SIZE>;
+
+ program NFS_LOCALIO_PROGRAM {
+ version LOCALIO_V1 {
+ void
+ NULL(void) = 0;
+
+ void
+ UUID_IS_LOCAL(uuid_t) = 1;
+ } = 1;
+ } = 400122;
+
+LOCALIO uses the same transport connection as NFS traffic. As such,
+LOCALIO is not registered with rpcbind.
+
+NFS Common and Client/Server Handshake
+======================================
+
+fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
+to generate a nonce (single-use UUID) and associated short-lived
+nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
+verification by the NFS server and if matched the NFS server populates
+members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
+transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
+clients_list from the nfs_common's uuids_list. See:
+fs/nfs/localio.c:nfs_local_probe()
+
+nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
+it has members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access). It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client and Server Interlock
+===============================
+
+LOCALIO provides the nfs_uuid_t object and associated interfaces to
+allow proper network namespace (net-ns) and NFSD object refcounting.
+
+LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref
+to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure
+each net-ns is not destroyed while in use by nfsd_open_local_fh(), and
+warrants a more detailed explanation:
+
+ nfsd_open_local_fh() uses nfsd_net_try_get() before opening its
+ nfsd_file handle and then the caller (NFS client) must drop the
+ reference for the nfsd_file and associated net-ns using
+ nfsd_file_put_local() once it has completed its IO.
+
+ This interlock working relies heavily on nfsd_open_local_fh() being
+ afforded the ability to safely deal with the possibility that the
+ NFSD's net-ns (and nfsd_net by association) may have been destroyed
+ by nfsd_destroy_serv() via nfsd_shutdown_net().
+
+This interlock of the NFS client and server has been verified to fix an
+easy to hit crash that would occur if an NFSD instance running in a
+container, with a LOCALIO client mounted, is shutdown. Upon restart of
+the container and associated NFSD, the client would go on to crash due
+to NULL pointer dereference that occurred due to the LOCALIO client's
+attempting to nfsd_open_local_fh() without having a proper reference on
+NFSD's net-ns.
+
+NFS Client issues IO instead of Server
+======================================
+
+Because LOCALIO is focused on protocol bypass to achieve improved IO
+performance, alternatives to the traditional NFS wire protocol (SUNRPC
+with XDR) must be provided to access the backing filesystem.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of select nfs server objects to allow a client local to a
+server to open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the associated nfsd network namespace and nn->nfsd_serv in terms of
+RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
+nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again. This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a LOCALIO client is connected to it.
+
+Once the client has an open nfsd_file pointer it will issue reads,
+writes and commits directly to the underlying local filesystem (normally
+done by the nfs server). As such, for these operations, the NFS client
+is issuing IO to the underlying local filesystem that it is sharing with
+the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
+fs/nfs/localio.c:nfs_local_commit().
+
+With normal NFS that makes use of RPC to issue IO to the server, if an
+application uses O_DIRECT the NFS client will bypass the pagecache but
+the NFS server will not. The NFS server's use of buffered IO affords
+applications to be less precise with their alignment when issuing IO to
+the NFS client. But if all applications properly align their IO, LOCALIO
+can be configured to use end-to-end O_DIRECT semantics from the NFS
+client to the underlying local filesystem, that it is sharing with
+the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module
+parameter to Y, e.g.:
+
+ echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
+
+Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics
+(but again, this may cause IO to fail if applications do not properly
+align their IO).
+
+Security
+========
+
+LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka
+AUTH_SYS) is used.
+
+Care is taken to ensure the same NFS security mechanisms are used
+(authentication, etc) regardless of whether LOCALIO or regular NFS
+access is used. The auth_domain established as part of the traditional
+NFS client access to the NFS server is also used for LOCALIO.
+
+Relative to containers, LOCALIO gives the client access to the network
+namespace the server has. This is required to allow the client to access
+the server's per-namespace nfsd_net struct. With traditional NFS, the
+client is afforded this same level of access (albeit in terms of the NFS
+protocol via SUNRPC). No other namespaces (user, mount, etc) have been
+altered or purposely extended from the server to the client.
+
+Module Parameters
+=================
+
+/sys/module/nfs/parameters/localio_enabled (bool)
+controls if LOCALIO is enabled, defaults to Y. If client and server are
+local but 'localio_enabled' is set to N then LOCALIO will not be used.
+
+/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)
+controls if O_DIRECT extends down to the underlying filesystem, defaults
+to N. Application IO must be logical blocksize aligned, otherwise
+O_DIRECT will fail.
+
+/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)
+controls if NFSv3 read and write IOs will trigger (re)enabling of
+LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0
+(disabled). Must be power-of-2, admin keeps all the pieces if they
+misconfigure (too low a value or non-power-of-2).
+
+Testing
+=======
+
+The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
+and commit access have proven stable against various test scenarios:
+
+- Client and server both on the same host.
+
+- All permutations of client and server support enablement for both
+ local and remote client and server.
+
+- Testing against NFS storage products that don't support the LOCALIO
+ protocol was also performed.
+
+- Client on host, server within a container (for both v3 and v4.2).
+ The container testing was in terms of podman managed containers and
+ includes successful container stop/restart scenario.
+
+- Formalizing these test scenarios in terms of existing test
+ infrastructure is on-going. Initial regular coverage is provided in
+ terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
+ mount configuration, and includes lockdep and KASAN coverage, see:
+ https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
+ https://github.com/koverstreet/ktest
+
+- Various kdevops testing (in terms of "Chuck's BuildBot") has been
+ performed to regularly verify the LOCALIO changes haven't caused any
+ regressions to non-LOCALIO NFS use cases.
+
+- All of Hammerspace's various sanity tests pass with LOCALIO enabled
+ (this includes numerous pNFS and flexfiles tests).
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
index ff9ae4a46530..044be965d75e 100644
--- a/Documentation/filesystems/nfs/reexport.rst
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -26,9 +26,13 @@ Reboot recovery
---------------
The NFS protocol's normal reboot recovery mechanisms don't work for the
-case when the reexport server reboots. Clients will lose any locks
-they held before the reboot, and further IO will result in errors.
-Closing and reopening files should clear the errors.
+case when the reexport server reboots because the source server has not
+rebooted, and so it is not in grace. Since the source server is not in
+grace, it cannot offer any guarantees that the file won't have been
+changed between the locks getting lost and any attempt to recover them.
+The same applies to delegations and any associated locks. Clients are
+not allowed to get file locks or delegations from a reexport server, any
+attempts will fail with operation not supported.
Filehandle limits
-----------------
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 165514401441..2db379b4b31e 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -156,7 +156,7 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
-An opaque directory should not conntain any whiteouts, because they do not
+An opaque directory should not contain any whiteouts, because they do not
serve any purpose. A merge directory containing regular files with the xattr
"trusted.overlay.whiteout", should be additionally marked by setting the xattr
"trusted.overlay.opaque" to "x" on the merge directory itself.
@@ -266,7 +266,7 @@ Non-directories
Objects that are not directories (files, symlinks, device-special
files etc.) are presented either from the upper or lower filesystem as
appropriate. When a file in the lower filesystem is accessed in a way
-the requires write-access, such as opening for write access, changing
+that requires write-access, such as opening for write access, changing
some metadata etc., the file is first copied from the lower filesystem
to the upper filesystem (copy_up). Note that creating a hard-link
also requires copy_up, though of course creation of a symlink does
@@ -292,13 +292,27 @@ rename or unlink will of course be noticed and handled).
Permission model
----------------
+An overlay filesystem stashes credentials that will be used when
+accessing lower or upper filesystems.
+
+In the old mount api the credentials of the task calling mount(2) are
+stashed. In the new mount api the credentials of the task creating the
+superblock through FSCONFIG_CMD_CREATE command of fsconfig(2) are
+stashed.
+
+Starting with kernel v6.15 it is possible to use the "override_creds"
+mount option which will cause the credentials of the calling task to be
+recorded. Note that "override_creds" is only meaningful when used with
+the new mount api as the old mount api combines setting options and
+superblock creation in a single mount(2) syscall.
+
Permission checking in the overlay filesystem follows these principles:
1) permission check SHOULD return the same result before and after copy up
2) task creating the overlay mount MUST NOT gain additional privileges
- 3) non-mounting task MAY gain additional privileges through the overlay,
+ 3) task[*] MAY gain additional privileges through the overlay,
compared to direct access on underlying lower or upper filesystems
This is achieved by performing two permission checks on each access:
@@ -306,7 +320,7 @@ This is achieved by performing two permission checks on each access:
a) check if current task is allowed access based on local DAC (owner,
group, mode and posix acl), as well as MAC checks
- b) check if mounting task would be allowed real operation on lower or
+ b) check if stashed credentials would be allowed real operation on lower or
upper layer based on underlying filesystem permissions, again including
MAC checks
@@ -315,10 +329,10 @@ are copied up. On the other hand it can result in server enforced
permissions (used by NFS, for example) being ignored (3).
Check (b) ensures that no task gains permissions to underlying layers that
-the mounting task does not have (2). This also means that it is possible
+the stashed credentials do not have (2). This also means that it is possible
to create setups where the consistency rule (1) does not hold; normally,
-however, the mounting task will have sufficient privileges to perform all
-operations.
+however, the stashed credentials will have sufficient privileges to
+perform all operations.
Another way to demonstrate this model is drawing parallels between::
@@ -367,8 +381,11 @@ Metadata only copy up
When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
-like chown/chmod is performed. Full file will be copied up later when
-file is opened for WRITE operation.
+like chown/chmod is performed. An upper file in this state is marked with
+"trusted.overlayfs.metacopy" xattr which indicates that the upper file
+contains no data. The data will be copied up later when file is opened for
+WRITE operation. After the lower file's data is copied up,
+the "trusted.overlayfs.metacopy" xattr is removed from the upper file.
In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.
@@ -437,6 +454,23 @@ For example::
fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+Specifying layers via file descriptors
+--------------------------------------
+
+Since kernel v6.13, overlayfs supports specifying layers via file descriptors in
+addition to specifying them as paths. This feature is available for the
+"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the
+fsconfig syscall from the new mount api::
+
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper);
+
+
fs-verity support
-----------------
@@ -529,8 +563,8 @@ Nesting overlayfs mounts
It is possible to use a lower directory that is stored on an overlayfs
mount. For regular files this does not need any special care. However, files
-that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs will be
-interpreted by the underlying overlayfs mount and stripped out. In order to
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs, will
+be interpreted by the underlying overlayfs mount and stripped out. In order to
allow the second overlayfs mount to see the attributes they must be escaped.
Overlayfs specific xattrs are escaped by using a special prefix of
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 2b2df6aa5432..9ced1135608e 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -531,7 +531,7 @@ this retry process in the next article.
Automount points are locations in the filesystem where an attempt to
lookup a name can trigger changes to how that lookup should be
handled, in particular by mounting a filesystem there. These are
-covered in greater detail in autofs.txt in the Linux documentation
+covered in greater detail in autofs.rst in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 1aa7ce099f6f..d2cf2852e1f8 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -379,4 +379,4 @@ Papers and other documentation on dcache locking
2. http://lse.sourceforge.net/locking/dcache/dcache.html
-3. path-lookup.md in this directory.
+3. path-lookup.rst in this directory.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 1be76ef117b3..3616d7161dab 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -177,7 +177,7 @@ settles down a bit.
**mandatory**
s_export_op is now required for exporting a filesystem.
-isofs, ext2, ext3, reiserfs, fat
+isofs, ext2, ext3, fat
can be used as examples of very different filesystems.
---
@@ -313,7 +313,7 @@ done.
**mandatory**
-block truncatation on error exit from ->write_begin, and ->direct_IO
+block truncation on error exit from ->write_begin, and ->direct_IO
moved from generic methods (block_write_begin, cont_write_begin,
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
ext2_write_failed and callers for an example.
@@ -858,7 +858,7 @@ be misspelled d_alloc_anon().
**mandatory**
-[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+[should've been added in 2016] stale comment in finish_open() notwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
@@ -989,7 +989,7 @@ This mechanism would only work for a single device so the block layer couldn't
find the owning superblock of any additional devices.
In the old mechanism reusing or creating a superblock for a racing mount(2) and
-umount(2) relied on the file_system_type as the holder. This was severly
+umount(2) relied on the file_system_type as the holder. This was severely
underdocumented however:
(1) Any concurrent mounter that managed to grab an active reference on an
@@ -1134,3 +1134,118 @@ superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
device freezing now works for any block device owned by a given superblock, not
just the main block device. The get_active_super() helper and bd_fsfreeze_sb
pointer are gone.
+
+---
+
+**mandatory**
+
+set_blocksize() takes opened struct file instead of struct block_device now
+and it *must* be opened exclusive.
+
+---
+
+**mandatory**
+
+->d_revalidate() gets two extra arguments - inode of parent directory and
+name our dentry is expected to have. Both are stable (dir is pinned in
+non-RCU case and will stay around during the call in RCU case, and name
+is guaranteed to stay unchanging). Your instance doesn't have to use
+either, but it often helps to avoid a lot of painful boilerplate.
+Note that while name->name is stable and NUL-terminated, it may (and
+often will) have name->name[name->len] equal to '/' rather than '\0' -
+in normal case it points into the pathname being looked up.
+NOTE: if you need something like full path from the root of filesystem,
+you are still on your own - this assists with simple cases, but it's not
+magic.
+
+---
+
+**recommended**
+
+kern_path_locked() and user_path_locked() no longer return a negative
+dentry so this doesn't need to be checked. If the name cannot be found,
+ERR_PTR(-ENOENT) is returned.
+
+---
+
+**recommended**
+
+lookup_one_qstr_excl() is changed to return errors in more cases, so
+these conditions don't require explicit checks:
+
+ - if LOOKUP_CREATE is NOT given, then the dentry won't be negative,
+ ERR_PTR(-ENOENT) is returned instead
+ - if LOOKUP_EXCL IS given, then the dentry won't be positive,
+ ERR_PTR(-EEXIST) is rreturned instread
+
+LOOKUP_EXCL now means "target must not exist". It can be combined with
+LOOK_CREATE or LOOKUP_RENAME_TARGET.
+
+---
+
+**mandatory**
+invalidate_inodes() is gone use evict_inodes() instead.
+
+---
+
+**mandatory**
+
+->mkdir() now returns a dentry. If the created inode is found to
+already be in cache and have a dentry (often IS_ROOT()), it will need to
+be spliced into the given name in place of the given dentry. That dentry
+now needs to be returned. If the original dentry is used, NULL should
+be returned. Any error should be returned with ERR_PTR().
+
+In general, filesystems which use d_instantiate_new() to install the new
+inode can safely return NULL. Filesystems which may not have an I_NEW inode
+should use d_drop();d_splice_alias() and return the result of the latter.
+
+If a positive dentry cannot be returned for some reason, in-kernel
+clients such as cachefiles, nfsd, smb/server may not perform ideally but
+will fail-safe.
+
+---
+
+** mandatory**
+
+lookup_one(), lookup_one_unlocked(), lookup_one_positive_unlocked() now
+take a qstr instead of a name and len. These, not the "one_len"
+versions, should be used whenever accessing a filesystem from outside
+that filesysmtem, through a mount point - which will have a mnt_idmap.
+
+---
+
+** mandatory**
+
+Functions try_lookup_one_len(), lookup_one_len(),
+lookup_one_len_unlocked() and lookup_positive_unlocked() have been
+renamed to try_lookup_noperm(), lookup_noperm(),
+lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
+take a qstr instead of separate name and length. QSTR() can be used
+when strlen() is needed for the length.
+
+For try_lookup_noperm() a reference to the qstr is passed in case the
+hash might subsequently be needed.
+
+These function no longer do any permission checking - they previously
+checked that the caller has 'X' permission on the parent. They must
+ONLY be used internally by a filesystem on itself when it knows that
+permissions are irrelevant or in a context where permission checks have
+already been performed such as after vfs_path_parent_lookup()
+
+---
+
+** mandatory**
+
+d_hash_and_lookup() is no longer exported or available outside the VFS.
+Use try_lookup_noperm() instead. This adds name validation and takes
+arguments in the opposite order but is otherwise identical.
+
+Using try_lookup_noperm() will require linux/namei.h to be included.
+
+---
+
+**mandatory**
+
+Calling conventions for ->d_automount() have changed; we should *not* grab
+an extra reference to new mount - it should be returned with refcount 1.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c6a6b9df2104..2a17865dfe39 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -48,6 +48,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
3.13 /proc/<pid>/fd - List of symlinks to open files
+ 3.14 /proc/<pid/ksm_stat - Information about the process's ksm status.
4 Configuring procfs
4.1 Mount options
@@ -127,6 +128,16 @@ process running on the system, which is named after the process ID (PID).
The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
+A process can read its own information from /proc/PID/* with no extra
+permissions. When reading /proc/PID/* information for other processes, reading
+process is required to have either CAP_SYS_PTRACE capability with
+PTRACE_MODE_READ access permissions, or, alternatively, CAP_PERFMON
+capability. This applies to all read-only information like `maps`, `environ`,
+`pagemap`, etc. The only exception is `mem` file due to its read-write nature,
+which requires CAP_SYS_PTRACE capabilities with more elevated
+PTRACE_MODE_ATTACH permissions; CAP_PERFMON capability does not grant access
+to /proc/PID/mem for other processes.
+
Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
@@ -443,6 +454,15 @@ is not associated with a file:
or if empty, the mapping is anonymous.
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs. This interface is binary and is meant for more
+efficient and easy programmatic use. `struct procmap_query`, defined in
+linux/fs.h UAPI header, serves as an input/output argument to the
+`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for
+details on query semantics, supported flags, data returned, and general API
+usage information.
+
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following::
@@ -475,14 +495,15 @@ Memory Area, or VMA) there is a series of lines such as the following::
THPeligible: 0
VmFlags: rd ex mr mw me dw
-The first of these lines shows the same information as is displayed for the
-mapping in /proc/PID/maps. Following lines show the size of the mapping
-(size); the size of each page allocated when backing a VMA (KernelPageSize),
-which is usually the same as the size in the page table entries; the page size
-used by the MMU when backing a VMA (in most cases, the same as KernelPageSize);
-the amount of the mapping that is currently resident in RAM (RSS); the
-process' proportional share of this mapping (PSS); and the number of clean and
-dirty shared and private pages in the mapping.
+The first of these lines shows the same information as is displayed for
+the mapping in /proc/PID/maps. Following lines show the size of the
+mapping (size); the size of each page allocated when backing a VMA
+(KernelPageSize), which is usually the same as the size in the page table
+entries; the page size used by the MMU when backing a VMA (in most cases,
+the same as KernelPageSize); the amount of the mapping that is currently
+resident in RAM (RSS); the process's proportional share of this mapping
+(PSS); and the number of clean and dirty shared and private pages in the
+mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
@@ -491,9 +512,25 @@ process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
consists of dirty pages. ("Pss_Clean" is not included, but it can be
calculated by subtracting "Pss_Dirty" from "Pss".)
-Note that even a page which is part of a MAP_SHARED mapping, but has only
-a single pte mapped, i.e. is currently used by only one process, is accounted
-as private and not as shared.
+Traditionally, a page is accounted as "private" if it is mapped exactly once,
+and a page is accounted as "shared" when mapped multiple times, even when
+mapped in the same process multiple times. Note that this accounting is
+independent of MAP_SHARED.
+
+In some kernel configurations, the semantics of pages part of a larger
+allocation (e.g., THP) can differ: a page is accounted as "private" if all
+pages part of the corresponding large allocation are *certainly* mapped in the
+same process, even if the page is mapped multiple times in that process. A
+page is accounted as "shared" if any page page of the larger allocation
+is *maybe* mapped in a different process. In some cases, a large allocation
+might be treated as "maybe mapped by multiple processes" even though this
+is no longer the case.
+
+Some kernel configurations do not track the precise number of times a page part
+of a larger allocation is mapped. In this case, when calculating the PSS, the
+average number of mappings per page in this larger allocation might be used
+as an approximation for the number of mappings of a page. The PSS calculation
+will be imprecise in this case.
"Referenced" indicates the amount of memory currently marked as referenced or
accessed.
@@ -570,7 +607,8 @@ encoded manner. The codes are the following:
mt arm64 MTE allocation tags are enabled
um userfaultfd missing tracking
uw userfaultfd wr-protect tracking
- ss shadow stack page
+ ss shadow/guarded control stack page
+ sl sealed
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
@@ -674,6 +712,11 @@ Where:
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
+Note that some kernel configurations do not track the precise number of times
+a page part of a larger allocation (e.g., THP) is mapped. In these
+configurations, "mapmax" might corresponds to the average number of mappings
+per page in such a larger allocation instead.
+
1.2 Kernel data
---------------
@@ -688,6 +731,7 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
bootconfig Kernel command line obtained from boot config,
and, if there were kernel parameters from the
@@ -953,6 +997,35 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
+allocinfo
+~~~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module (if originates from a loadable module) and the function calling
+the allocation. The number of bytes allocated and number of calls at each
+location are reported. The first line indicates the version of the file, the
+second line is the header listing fields in the file.
+
+Example output.
+
+::
+
+ > tail -n +3 /proc/allocinfo | sort -rn
+ 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
+ 56373248 4737 mm/slub.c:2259 func:alloc_slab_page
+ 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
+ 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
+ 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
+ 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
+ 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
+ 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
+ 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
+ 3940352 962 mm/memory.c:4214 func:alloc_anon_folio
+ 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
+ ...
+
+
meminfo
~~~~~~~
@@ -1018,6 +1091,8 @@ Example output. You may not have all of these fields.
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
+ Unaccepted: 0 kB
+ Balloon: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
@@ -1090,9 +1165,15 @@ Dirty
Writeback
Memory which is actively being written back to the disk
AnonPages
- Non-file backed pages mapped into userspace page tables
+ Non-file backed pages mapped into userspace page tables. Note that
+ some kernel configurations might consider all pages part of a
+ larger allocation (e.g., THP) as "mapped", as soon as a single
+ page is mapped.
Mapped
- files which have been mmapped, such as libraries
+ files which have been mmapped, such as libraries. Note that some
+ kernel configurations might consider all pages part of a larger
+ allocation (e.g., THP) as "mapped", as soon as a single page is
+ mapped.
Shmem
Total memory used by shared memory (shmem) and tmpfs
KReclaimable
@@ -1110,8 +1191,8 @@ KernelStack
PageTables
Memory consumed by userspace page tables
SecPageTables
- Memory consumed by secondary page tables, this currently
- currently includes KVM mmu allocations on x86 and arm64.
+ Memory consumed by secondary page tables, this currently includes
+ KVM mmu and IOMMU allocations on x86 and arm64.
NFS_Unstable
Always zero. Previous counted pages which had been written to
the server, but has not been committed to stable storage.
@@ -1186,6 +1267,10 @@ CmaTotal
Memory reserved for the Contiguous Memory Allocator (CMA)
CmaFree
Free remaining memory in the CMA reserves
+Unaccepted
+ Memory that has not been accepted by the guest
+Balloon
+ Memory returned to Host by VM Balloon Drivers
HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
See Documentation/admin-guide/mm/hugetlbpage.rst.
DirectMap4k, DirectMap2M, DirectMap1G
@@ -2192,6 +2277,74 @@ The number of open files for the process is stored in 'size' member
of stat() output for /proc/<pid>/fd for fast access.
-------------------------------------------------------
+3.14 /proc/<pid/ksm_stat - Information about the process's ksm status
+---------------------------------------------------------------------
+When CONFIG_KSM is enabled, each process has this file which displays
+the information of ksm merging status.
+
+Example
+~~~~~~~
+
+::
+
+ / # cat /proc/self/ksm_stat
+ ksm_rmap_items 0
+ ksm_zero_pages 0
+ ksm_merging_pages 0
+ ksm_process_profit 0
+ ksm_merge_any: no
+ ksm_mergeable: no
+
+Description
+~~~~~~~~~~~
+
+ksm_rmap_items
+^^^^^^^^^^^^^^
+
+The number of ksm_rmap_item structures in use. The structure
+ksm_rmap_item stores the reverse mapping information for virtual
+addresses. KSM will generate a ksm_rmap_item for each ksm-scanned page of
+the process.
+
+ksm_zero_pages
+^^^^^^^^^^^^^^
+
+When /sys/kernel/mm/ksm/use_zero_pages is enabled, it represent how many
+empty pages are merged with kernel zero pages by KSM.
+
+ksm_merging_pages
+^^^^^^^^^^^^^^^^^
+
+It represents how many pages of this process are involved in KSM merging
+(not including ksm_zero_pages). It is the same with what
+/proc/<pid>/ksm_merging_pages shows.
+
+ksm_process_profit
+^^^^^^^^^^^^^^^^^^
+
+The profit that KSM brings (Saved bytes). KSM can save memory by merging
+identical pages, but also can consume additional memory, because it needs
+to generate a number of rmap_items to save each scanned page's brief rmap
+information. Some of these pages may be merged, but some may not be abled
+to be merged after being checked several times, which are unprofitable
+memory consumed.
+
+ksm_merge_any
+^^^^^^^^^^^^^
+
+It specifies whether the process's 'mm is added by prctl() into the
+candidate list of KSM or not, and if KSM scanning is fully enabled at
+process level.
+
+ksm_mergeable
+^^^^^^^^^^^^^
+
+It specifies whether any VMAs of the process''s mms are currently
+applicable to KSM.
+
+More information about KSM can be found in
+Documentation/admin-guide/mm/ksm.rst.
+
Chapter 4: Configuring procfs
=============================
@@ -2221,7 +2374,7 @@ arguments are now protected against local eavesdroppers.
hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
fully invisible to other users. It doesn't mean that it hides a fact whether a
process with a specific pid value exists (it can be learned by other means, e.g.
-by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+by "kill -0 $PID"), but it hides process's uid and gid, which may be learned by
stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
gathering information about running processes, whether some daemon runs with
elevated privileges, whether other user runs some sensitive program, whether
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 447f767c6462..fa4f81099cb4 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -315,7 +315,7 @@ the above threads) is:
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
- explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ explained in buffer-format.rst, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst
index 04ad083cfe62..301ff4c6e6c6 100644
--- a/Documentation/filesystems/relay.rst
+++ b/Documentation/filesystems/relay.rst
@@ -32,7 +32,7 @@ functions in the relay interface code - please see that for details.
Semantics
=========
-Each relay channel has one buffer per CPU, each buffer has one or more
+Each relay channel has one buffer per CPU; each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it is written to
the next (if available). Messages are never split across sub-buffers.
@@ -40,7 +40,7 @@ At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
-bytes of it are padding i.e. unused space occurring because a complete
+bytes of it are padding, i.e., unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
@@ -71,7 +71,7 @@ klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
-a couple simple utility functions and a set of examples are provided.
+a couple of simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
@@ -91,7 +91,7 @@ registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
-i.e. without using any of the relay-apps example code or klog, but
+i.e., without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
@@ -119,7 +119,7 @@ mmap() results in channel buffer being mapped into the caller's
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
- 'consumed' by the reader, i.e. they won't be available
+ 'consumed' by the reader, i.e., they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
@@ -138,7 +138,7 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
- reaches 0, i.e. when no process or kernel client has the
+ reaches 0, i.e., when no process or kernel client has the
buffer open, the channel buffer is freed.
=========== ============================================================
@@ -149,7 +149,7 @@ host filesystem must be mounted. For example::
.. Note::
- the host filesystem doesn't need to be mounted for kernel
+ The host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available
(including in create_buf_file()) via chan->private_data or
buf->chan->private_data.
-Buffer-only channels
---------------------
-
-These channels have no files associated and can be created with
-relay_open(NULL, NULL, ...). Such channels are useful in scenarios such
-as when doing early tracing in the kernel, before the VFS is up. In these
-cases, one may open a buffer-only channel and then call
-relay_late_setup_files() when the kernel is ready to handle files,
-to expose the buffered data to the userspace.
-
Channel 'modes'
---------------
@@ -325,7 +315,7 @@ section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
-been consumed. In no-overwrite mode, writes will fail, i.e. data will
+been consumed. In no-overwrite mode, writes will fail, i.e., data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
@@ -344,7 +334,7 @@ initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
-To implement 'no-overwrite' mode, the userspace client would provide
+To implement 'no-overwrite' mode, the userspace client provides
an implementation of the subbuf_start() callback something like the
following::
@@ -364,9 +354,9 @@ following::
return 1;
}
-If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+If the current buffer is full, i.e., all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
-occur yet, i.e. until the consumer has had a chance to read the
+occur yet, i.e., until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is responsible for notifying the relay
interface when sub-buffers have been consumed via
@@ -400,7 +390,7 @@ consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
-implements the simplest possible 'no-overwrite' mode, i.e. it does
+implements the simplest possible 'no-overwrite' mode, i.e., it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
@@ -467,7 +457,7 @@ rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
-do so, i.e. when the channel isn't currently being written to.
+do so, i.e., when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
new file mode 100644
index 000000000000..c7949dd44f2f
--- /dev/null
+++ b/Documentation/filesystems/resctrl.rst
@@ -0,0 +1,1523 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=====================================================
+User Interface for Resource Control feature (resctrl)
+=====================================================
+
+:Copyright: |copy| 2016 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+ - Tony Luck <tony.luck@intel.com>
+ - Vikas Shivappa <vikas.shivappa@intel.com>
+
+
+Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
+AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
+
+This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
+flag bits:
+
+=============================================== ================================
+RDT (Resource Director Technology) Allocation "rdt_a"
+CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
+CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
+CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
+MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
+MBA (Memory Bandwidth Allocation) "mba"
+SMBA (Slow Memory Bandwidth Allocation) ""
+BMEC (Bandwidth Monitoring Event Configuration) ""
+=============================================== ================================
+
+Historically, new features were made visible by default in /proc/cpuinfo. This
+resulted in the feature flags becoming hard to parse by humans. Adding a new
+flag to /proc/cpuinfo should be avoided if user space can obtain information
+about the feature from resctrl's info directory.
+
+To use the feature mount the file system::
+
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
+
+mount options are:
+
+"cdp":
+ Enable code/data prioritization in L3 cache allocations.
+"cdpl2":
+ Enable code/data prioritization in L2 cache allocations.
+"mba_MBps":
+ Enable the MBA Software Controller(mba_sc) to specify MBA
+ bandwidth in MiBps
+"debug":
+ Make debug files accessible. Available debug files are annotated with
+ "Available only with debug option".
+
+L2 and L3 CDP are controlled separately.
+
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
+
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
+
+Info directory
+==============
+
+The 'info' directory contains information about the enabled
+resources. Each resource has its own subdirectory. The subdirectory
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:
+
+"num_closids":
+ The number of CLOSIDs which are valid for this
+ resource. The kernel uses the smallest number of
+ CLOSIDs of all enabled resources as limit.
+"cbm_mask":
+ The bitmask which is valid for this resource.
+ This mask is equivalent to 100%.
+"min_cbm_bits":
+ The minimum number of consecutive bits which
+ must be set when writing a mask.
+
+"shareable_bits":
+ Bitmask of shareable resource with other executing
+ entities (e.g. I/O). User can use this when
+ setting up exclusive cache partitions. Note that
+ some platforms support devices that have their
+ own settings for cache use which can over-ride
+ these bits.
+"bit_usage":
+ Annotated capacity bitmasks showing how all
+ instances of the resource are used. The legend is:
+
+ "0":
+ Corresponding region is unused. When the system's
+ resources have been allocated and a "0" is found
+ in "bit_usage" it is a sign that resources are
+ wasted.
+
+ "H":
+ Corresponding region is used by hardware only
+ but available for software use. If a resource
+ has bits set in "shareable_bits" but not all
+ of these bits appear in the resource groups'
+ schematas then the bits appearing in
+ "shareable_bits" but no resource group will
+ be marked as "H".
+ "X":
+ Corresponding region is available for sharing and
+ used by hardware and software. These are the
+ bits that appear in "shareable_bits" as
+ well as a resource group's allocation.
+ "S":
+ Corresponding region is used by software
+ and available for sharing.
+ "E":
+ Corresponding region is used exclusively by
+ one resource group. No sharing allowed.
+ "P":
+ Corresponding region is pseudo-locked. No
+ sharing allowed.
+"sparse_masks":
+ Indicates if non-contiguous 1s value in CBM is supported.
+
+ "0":
+ Only contiguous 1s value in CBM is supported.
+ "1":
+ Non-contiguous 1s value in CBM is supported.
+
+Memory bandwidth(MB) subdirectory contains the following files
+with respect to allocation:
+
+"min_bandwidth":
+ The minimum memory bandwidth percentage which
+ user can request.
+
+"bandwidth_gran":
+ The granularity in which the memory bandwidth
+ percentage is allocated. The allocated
+ b/w percentage is rounded off to the next
+ control step available on the hardware. The
+ available bandwidth control steps are:
+ min_bandwidth + N * bandwidth_gran.
+
+"delay_linear":
+ Indicates if the delay scale is linear or
+ non-linear. This field is purely informational
+ only.
+
+"thread_throttle_mode":
+ Indicator on Intel systems of how tasks running on threads
+ of a physical core are throttled in cases where they
+ request different memory bandwidth percentages:
+
+ "max":
+ the smallest percentage is applied
+ to all threads
+ "per-thread":
+ bandwidth percentages are directly applied to
+ the threads running on the core
+
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features":
+ Lists the monitoring events if
+ monitoring is enabled for the resource.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes
+
+ If the system supports Bandwidth Monitoring Event
+ Configuration (BMEC), then the bandwidth events will
+ be configurable. The output will be::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_bytes_config
+ mbm_local_bytes
+ mbm_local_bytes_config
+
+"mbm_total_bytes_config", "mbm_local_bytes_config":
+ Read/write files containing the configuration for the mbm_total_bytes
+ and mbm_local_bytes events, respectively, when the Bandwidth
+ Monitoring Event Configuration (BMEC) feature is supported.
+ The event configuration settings are domain specific and affect
+ all the CPUs in the domain. When either event configuration is
+ changed, the bandwidth counters for all RMIDs of both events
+ (mbm_total_bytes as well as mbm_local_bytes) are cleared for that
+ domain. The next read for every RMID will report "Unavailable"
+ and subsequent reads will report the valid value.
+
+ Following are the types of events supported:
+
+ ==== ========================================================
+ Bits Description
+ ==== ========================================================
+ 6 Dirty Victims from the QOS domain to all types of memory
+ 5 Reads to slow memory in the non-local NUMA domain
+ 4 Reads to slow memory in the local NUMA domain
+ 3 Non-temporal writes to non-local NUMA domain
+ 2 Non-temporal writes to local NUMA domain
+ 1 Reads to memory in the non-local NUMA domain
+ 0 Reads to memory in the local NUMA domain
+ ==== ========================================================
+
+ By default, the mbm_total_bytes configuration is set to 0x7f to count
+ all the event types and the mbm_local_bytes configuration is set to
+ 0x15 to count all the local memory events.
+
+ Examples:
+
+ * To view the current configuration::
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x7f;1=0x7f;2=0x7f;3=0x7f
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x15;1=0x15;3=0x15;4=0x15
+
+ * To change the mbm_total_bytes to count only reads on domain 0,
+ the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
+ (in hexadecimal 0x33):
+ ::
+
+ # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x33;1=0x7f;2=0x7f;3=0x7f
+
+ * To change the mbm_local_bytes to count all the slow memory reads on
+ domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
+ in binary (in hexadecimal 0x30):
+ ::
+
+ # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x30;1=0x30;3=0x15;4=0x15
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+Finally, in the top level of the "info" directory there is a file
+named "last_cmd_status". This is reset with every "command" issued
+via the file system (making new directories or writing to any of the
+control files). If the command was successful, it will read as "ok".
+If the command failed, it will provide more information that can be
+conveyed in the error returns from file operations. E.g.
+::
+
+ # echo L3:0=f7 > schemata
+ bash: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ mask f7 has non-consecutive 1-bits
+
+Resource alloc and monitor groups
+=================================
+
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+Moving MON group directories to a new parent CTRL_MON group is supported
+for the purpose of changing the resource allocations of a MON group
+without impacting its monitoring data or assigned tasks. This operation
+is not allowed for MON groups which monitor CPUs. No other move
+operation is currently allowed other than simply renaming a CTRL_MON or
+MON group.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. Multiple tasks can be added by separating the task ids
+ with commas. Tasks will be assigned sequentially. Multiple
+ failures are not supported. A single failure encountered while
+ attempting to assign a task will cause the operation to abort and
+ already added tasks before the failure will remain in the group.
+ Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
+
+ If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+ When the resource group is in pseudo-locked mode this file will
+ only be readable, reflecting the CPUs associated with the
+ pseudo-locked region.
+
+
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.
+
+
+When control is enabled all CTRL_MON groups will also contain:
+
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.
+
+"size":
+ Mirrors the display of the "schemata" file to display the size in
+ bytes of each allocation instead of the bits representing the
+ allocation.
+
+"mode":
+ The "mode" of the resource group dictates the sharing of its
+ allocations. A "shareable" resource group allows sharing of its
+ allocations while an "exclusive" resource group does not. A
+ cache pseudo-locked region is created by first writing
+ "pseudo-locksetup" to the "mode" file before writing the cache
+ pseudo-locked region's schemata to the resource group's "schemata"
+ file. On successful pseudo-locked region creation the mode will
+ automatically change to "pseudo-locked".
+
+"ctrl_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the control group. On x86 this is the CLOSID.
+
+When monitoring is enabled all MON groups will also contain:
+
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
+
+"mon_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the monitor group. On x86 this is the RMID.
+
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
+Resource allocation rules
+-------------------------
+
+When a task is running the following rules define which resources are
+available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+ for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+===============================================
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
+
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps). To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
+if non-contiguous 1s value is supported. On a system with a 20-bit mask
+each bit represents 5% of the capacity of the cache. You could partition
+the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
+Memory bandwidth Allocation and monitoring
+==========================================
+
+For Memory bandwidth resource, by default the user controls the resource
+by indicating the percentage of total memory bandwidth.
+
+The minimum bandwidth percentage value for each cpu model is predefined
+and can be looked up through "info/MB/min_bandwidth". The bandwidth
+granularity that is allocated is also dependent on the cpu model and can
+be looked up at "info/MB/bandwidth_gran". The available bandwidth
+control steps are: min_bw + N * bw_gran. Intermediate values are rounded
+to the next control step available on the hardware.
+
+The bandwidth throttling is a core specific mechanism on some of Intel
+SKUs. Using a high bandwidth and a low bandwidth setting on two threads
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
+specific mechanism where as memory bandwidth monitoring(MBM) is done at
+the package level may lead to confusion when users try to apply control
+via the MBA and then monitor the bandwidth to see if the controls are
+effective. Below are such scenarios:
+
+1. User may *not* see increase in actual bandwidth when percentage
+ values are increased:
+
+This can occur when aggregate L2 external bandwidth is more than L3
+external bandwidth. Consider an SKL SKU with 24 cores on a package and
+where L2 external is 10GBps (hence aggregate L2 external bandwidth is
+240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
+threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
+bandwidth of 100GBps although the percentage value specified is only 50%
+<< 100%. Hence increasing the bandwidth percentage will not yield any
+more bandwidth. This is because although the L2 external bandwidth still
+has capacity, the L3 external bandwidth is fully used. Also note that
+this would be dependent on number of cores the benchmark is run on.
+
+2. Same bandwidth percentage may mean different actual bandwidth
+ depending on # of threads:
+
+For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
+thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
+they have same percentage bandwidth of 10%. This is simply because as
+threads start using more cores in an rdtgroup, the actual bandwidth may
+increase or vary although user specified bandwidth percentage is same.
+
+In order to mitigate this and make the interface more user friendly,
+resctrl added support for specifying the bandwidth in MiBps as well. The
+kernel underneath would use a software feedback mechanism or a "Software
+Controller(mba_sc)" which reads the actual bandwidth using MBM counters
+and adjust the memory bandwidth percentages to ensure::
+
+ "actual bandwidth < user specified bandwidth".
+
+By default, the schemata would take the bandwidth percentage values
+where as user can switch to the "MBA software controller" mode using
+a mount option 'mba_MBps'. The schemata format is specified in the below
+sections.
+
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
+With CDP disabled the L3 schemata format is::
+
+ L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this::
+
+ L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L2 schemata file details
+------------------------
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
+
+ L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+or
+
+ L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+
+Memory bandwidth Allocation (default mode)
+------------------------------------------
+
+Memory b/w domain is L3 cache.
+::
+
+ MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
+
+Memory bandwidth domain is L3 cache.
+::
+
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
+
+Slow Memory Bandwidth Allocation (SMBA)
+---------------------------------------
+AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
+CXL.memory is the only supported "slow" memory device. With the
+support of SMBA, the hardware enables bandwidth allocation on
+the slow memory devices. If there are multiple such devices in
+the system, the throttling logic groups all the slow sources
+together and applies the limit on them as a whole.
+
+The presence of SMBA (with CXL.memory) is independent of slow memory
+devices presence. If there are no such devices on the system, then
+configuring SMBA will have no impact on the performance of the system.
+
+The bandwidth domain for slow memory is L3 cache. Its schemata file
+is formatted as:
+::
+
+ SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Reading/writing the schemata file
+---------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change. E.g.
+::
+
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+ # echo "L3DATA:2=3c0;" > schemata
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+
+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of one eighth GB/s.
+When writing to the file, you need to specify what cache id you wish to
+configure the bandwidth limit.
+
+For example, to allocate 2GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "MB:1=16" > schemata
+ # cat schemata
+ MB:0=2048;1= 16;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with SMBA feature
+--------------------------------------------------------------------
+Reading and writing the schemata file is the same as without SMBA in
+above section.
+
+For example, to allocate 8GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "SMBA:1=64" > schemata
+ # cat schemata
+ SMBA:0=2048;1= 64;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Cache Pseudo-Locking
+====================
+CAT enables a user to specify the amount of cache space that an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+The creation of a cache pseudo-locked region is triggered by a request
+from the user to do so that is accompanied by a schemata of the region
+to be pseudo-locked. The cache pseudo-locked region is created as follows:
+
+- Create a CAT allocation CLOSNEW with a CBM matching the schemata
+ from the user of the cache region that will contain the pseudo-locked
+ memory. This region must not overlap with any current CAT allocation/CLOS
+ on the system and no future overlap with this cache region is allowed
+ while the pseudo-locked region exists.
+- Create a contiguous region of memory of the same size as the cache
+ region.
+- Flush the cache, disable hardware prefetchers, disable preemption.
+- Make CLOSNEW the active CLOS and touch the allocated memory to load
+ it into the cache.
+- Set the previous CLOS as active.
+- At this point the closid CLOSNEW can be released - the cache
+ pseudo-locked region is protected as long as its CBM does not appear in
+ any CAT allocation. Even though the cache pseudo-locked region will from
+ this point on not appear in any CBM of any CLOS an application running with
+ any CLOS will be able to access the memory in the pseudo-locked region since
+ the region continues to serve cache hits.
+- The contiguous region of memory loaded into the cache is exposed to
+ user-space as a character device.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. A sanity check
+within the code will not allow an application to map pseudo-locked memory
+unless it runs with affinity to cores associated with the cache on which the
+pseudo-locked region resides. The sanity check is only done during the
+initial mmap() handling, there is no enforcement afterwards and the
+application self needs to ensure it remains affine to the correct cores.
+
+Pseudo-locking is accomplished in two stages:
+
+1) During the first stage the system administrator allocates a portion
+ of cache that should be dedicated to pseudo-locking. At this time an
+ equivalent portion of memory is allocated, loaded into allocated
+ cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+ pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+A pseudo-locked region is created using the resctrl interface as follows:
+
+1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
+2) Change the new resource group's mode to "pseudo-locksetup" by writing
+ "pseudo-locksetup" to the "mode" file.
+3) Write the schemata of the pseudo-locked region to the "schemata" file. All
+ bits within the schemata should be "unused" according to the "bit_usage"
+ file.
+
+On successful pseudo-locked region creation the "mode" file will contain
+"pseudo-locked" and a new character device with the same name as the resource
+group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
+by user space in order to obtain access to the pseudo-locked memory region.
+
+An example of cache pseudo-locked region creation and usage can be found below.
+
+Cache Pseudo-Locking Debugging Interface
+----------------------------------------
+The pseudo-locking debugging interface is enabled by default (if
+CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+ from these measurements are best visualized using a hist trigger (see
+ example below). In this test the pseudo-locked region is traversed at
+ a stride of 32 bytes while hardware prefetchers and preemption
+ are disabled. This also provides a substitute visualization of cache
+ hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+ available. Depending on the levels of cache on the system the pseudo_lock_l2
+ and pseudo_lock_l3 tracepoints are available.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
+write-only file, pseudo_lock_measure, is present in this directory. The
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+
+1:
+ writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2:
+ writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3:
+ writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
+
+Example of latency debugging interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region and
+visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
+is set::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+ #
+
+ { latency: 456 } hitcount: 1
+ { latency: 50 } hitcount: 83
+ { latency: 36 } hitcount: 96
+ { latency: 44 } hitcount: 174
+ { latency: 48 } hitcount: 195
+ { latency: 46 } hitcount: 262
+ { latency: 42 } hitcount: 693
+ { latency: 40 } hitcount: 3204
+ { latency: 38 } hitcount: 3484
+
+ Totals:
+ Hits: 8192
+ Entries: 9
+ Dropped: 0
+
+Example of cache hits/misses debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
+
+
+Examples for RDT allocation usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1) Example 1
+
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks, minimum b/w of 10% with a memory bandwidth
+granularity of 10%.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Similarly, tasks that are under the control of group "p0" may use a
+maximum memory b/w of 50% on socket0 and 50% on socket 1.
+Tasks in group "p1" may also use 50% memory b/w on both sockets.
+Note that unlike cache masks, memory b/w cannot specify whether these
+allocations can overlap or not. The allocations specifies the maximum
+b/w that the group may be able to use and the system admin can configure
+the b/w accordingly.
+
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
+::
+
+ # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
+2) Example 2
+
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
+ordinary tasks::
+
+ # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+::
+
+ # echo 1234 > p0/tasks
+ # taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache)::
+
+ # mkdir p1
+ # echo "L3:0=7c00;1=fffff" > p1/schemata
+ # echo 5678 > p1/tasks
+ # taskset -cp 2 5678
+
+For the same 2 socket system with memory b/w resource and CAT L3 the
+schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
+10):
+
+For our first real time task this would request 20% memory b/w on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+For our second real time task this would request an other 20% memory b/w
+on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+3) Example 3
+
+A single socket system which has real-time tasks running on core 4-7 and
+non real-time workload assigned to core 0-3. The real-time tasks share text
+and data, so a per task association is not required and due to interaction
+with the kernel it's desired that the kernel on these cores shares L3 with
+the tasks.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
+cannot be used by ordinary tasks::
+
+ # echo "L3:0=3ff\nMB:0=50" > schemata
+
+Next we make a resource group for our real time cores and give it access
+to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
+socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
+
+Finally we move core 4-7 over to the new group and make sure that the
+kernel and the tasks running there get 50% of the cache. They should
+also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
+siblings and only the real time threads are scheduled on the cores 4-7.
+::
+
+ # echo F0 > p0/cpus
+
+4) Example 4
+
+The resource groups in previous examples were all in the default "shareable"
+mode allowing sharing of their cache allocations. If one resource group
+configures a cache allocation then nothing prevents another resource group
+to overlap with that allocation.
+
+In this example a new exclusive resource group will be created on a L2 CAT
+system with two L2 cache instances that can be configured with an 8-bit
+capacity bitmask. The new exclusive resource group will be configured to use
+25% of each cache instance.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+First, we observe that the default group is configured to allocate to all L2
+cache::
+
+ # cat schemata
+ L2:0=ff;1=ff
+
+We could attempt to create the new resource group at this point, but it will
+fail because of the overlap with the schemata of the default group::
+
+ # mkdir p0
+ # echo 'L2:0=0x3;1=0x3' > p0/schemata
+ # cat p0/mode
+ shareable
+ # echo exclusive > p0/mode
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ schemata overlaps
+
+To ensure that there is no overlap with another resource group the default
+resource group's schemata has to change, making it possible for the new
+resource group to become exclusive.
+::
+
+ # echo 'L2:0=0xfc;1=0xfc' > schemata
+ # echo exclusive > p0/mode
+ # grep . p0/*
+ p0/cpus:0
+ p0/mode:exclusive
+ p0/schemata:L2:0=03;1=03
+ p0/size:L2:0=262144;1=262144
+
+A new resource group will on creation not overlap with an exclusive resource
+group::
+
+ # mkdir p1
+ # grep . p1/*
+ p1/cpus:0
+ p1/mode:shareable
+ p1/schemata:L2:0=fc;1=fc
+ p1/size:L2:0=786432;1=786432
+
+The bit_usage will reflect how the cache is used::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSEE;1=SSSSSSEE
+
+A resource group cannot be forced to overlap with an exclusive resource group::
+
+ # echo 'L2:0=0x1;1=0x1' > p1/schemata
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ overlaps with exclusive group
+
+Example of Cache Pseudo-Locking
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+Ensure that there are bits available that can be pseudo-locked, since only
+unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
+removed from the default resource group's schemata::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSSS
+ # echo 'L2:1=0xfc' > schemata
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSS00
+
+Create a new resource group that will be associated with the pseudo-locked
+region, indicate that it will be used for a pseudo-locked region, and
+configure the requested pseudo-locked region capacity bitmask::
+
+ # mkdir newlock
+ # echo pseudo-locksetup > newlock/mode
+ # echo 'L2:1=0x3' > newlock/schemata
+
+On success the resource group's mode will change to pseudo-locked, the
+bit_usage will reflect the pseudo-locked region, and the character device
+exposing the pseudo-locked region will exist::
+
+ # cat newlock/mode
+ pseudo-locked
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSPP
+ # ls -l /dev/pseudo_lock/newlock
+ crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
+
+::
+
+ /*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+ #define _GNU_SOURCE
+ #include <fcntl.h>
+ #include <sched.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+ #include <sys/mman.h>
+
+ /*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+ static int cpuid = 2;
+
+ int main(int argc, char *argv[])
+ {
+ cpu_set_t cpuset;
+ long page_size;
+ void *mapping;
+ int dev_fd;
+ int ret;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpuid, &cpuset);
+ ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+ if (ret < 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+
+ dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+ if (dev_fd < 0) {
+ perror("open");
+ exit(EXIT_FAILURE);
+ }
+
+ mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ dev_fd, 0);
+ if (mapping == MAP_FAILED) {
+ perror("mmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Application interacts with pseudo-locked memory @mapping */
+
+ ret = munmap(mapping, page_size);
+ if (ret < 0) {
+ perror("munmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ close(dev_fd);
+ exit(EXIT_SUCCESS);
+ }
+
+Locking between applications
+----------------------------
+
+Certain operations on the resctrl filesystem, composed of read/writes
+to/from multiple files, must be atomic.
+
+As an example, the allocation of an exclusive reservation of L3 cache
+involves:
+
+ 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
+ 2. Find a contiguous set of bits in the global CBM bitmask that is clear
+ in any of the directory cbmmasks
+ 3. Create a new directory
+ 4. Set the bits found in step 2 to the new directory "schemata" file
+
+If two applications attempt to allocate space concurrently then they can
+end up allocating the same bits so the reservations are shared instead of
+exclusive.
+
+To coordinate atomic operations on the resctrlfs and to avoid the problem
+above, the following locking procedure is recommended:
+
+Locking is based on flock, which is available in libc and also as a shell
+script command
+
+Write lock:
+
+ A) Take flock(LOCK_EX) on /sys/fs/resctrl
+ B) Read/write the directory structure.
+ C) funlock
+
+Read lock:
+
+ A) Take flock(LOCK_SH) on /sys/fs/resctrl
+ B) If success read the directory structure.
+ C) funlock
+
+Example with bash::
+
+ # Atomically read directory structure
+ $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
+
+ # Read directory contents and create new subdirectory
+
+ $ cat create-dir.sh
+ find /sys/fs/resctrl/ > output.txt
+ mask = function-of(output.txt)
+ mkdir /sys/fs/resctrl/newres/
+ echo mask > /sys/fs/resctrl/newres/schemata
+
+ $ flock /sys/fs/resctrl/ ./create-dir.sh
+
+Example with C::
+
+ /*
+ * Example code do take advisory locks
+ * before accessing resctrl filesystem
+ */
+ #include <sys/file.h>
+ #include <stdlib.h>
+
+ void resctrl_take_shared_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_SH);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_take_exclusive_lock(int fd)
+ {
+ int ret;
+
+ /* release lock on resctrl filesystem */
+ ret = flock(fd, LOCK_EX);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_release_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_UN);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void main(void)
+ {
+ int fd, ret;
+
+ fd = open("/sys/fs/resctrl", O_DIRECTORY);
+ if (fd == -1) {
+ perror("open");
+ exit(-1);
+ }
+ resctrl_take_shared_lock(fd);
+ /* code to read directory contents */
+ resctrl_release_lock(fd);
+
+ resctrl_take_exclusive_lock(fd);
+ /* code to read and write directory contents */
+ resctrl_release_lock(fd);
+ }
+
+Examples for RDT Monitoring along with allocation usage
+=======================================================
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+------------------------------------------------------------------------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+ # echo 5678 > p1/tasks
+ # echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+::
+
+ # cd /sys/fs/resctrl/p1/mon_groups
+ # mkdir m11 m12
+ # echo 5678 > m11/tasks
+ # echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+::
+
+ # cat m11/mon_data/mon_L3_00/llc_occupancy
+ 16234000
+ # cat m11/mon_data/mon_L3_01/llc_occupancy
+ 14789000
+ # cat m12/mon_data/mon_L3_00/llc_occupancy
+ 16789000
+
+The parent ctrl_mon group shows the aggregated data.
+::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31234000
+
+Example 2 (Monitor a task from its creation)
+--------------------------------------------
+On a two socket machine (one L3 cache per socket)::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+::
+
+ # echo $$ > /sys/fs/resctrl/p1/tasks
+ # <cmd>
+
+Fetch the data::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------------------------------------------------------------------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir mon_groups/m01
+ # mkdir mon_groups/m02
+
+ # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+ # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+::
+
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+ 34555
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+ 32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p1
+
+Move the cpus 4-7 over to p1::
+
+ # echo f0 > p1/cpus
+
+View the llc occupancy snapshot::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+ 11234000
+
+Intel RDT Errata
+================
+
+Intel MBM Counters May Report System Memory Bandwidth Incorrectly
+-----------------------------------------------------------------
+
+Errata SKX99 for Skylake server and BDF102 for Broadwell server.
+
+Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
+according to the assigned Resource Monitor ID (RMID) for that logical
+core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
+metrics, may report incorrect system bandwidth for certain RMID values.
+
+Implication: Due to the errata, system memory bandwidth may not match
+what is reported.
+
+Workaround: MBM total and local readings are corrected according to the
+following correction factor table:
+
++---------------+---------------+---------------+-----------------+
+|core count |rmid count |rmid threshold |correction factor|
++---------------+---------------+---------------+-----------------+
+|1 |8 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|2 |16 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|3 |24 |15 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|4 |32 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|6 |48 |31 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|7 |56 |47 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|8 |64 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|9 |72 |63 |1.185115 |
++---------------+---------------+---------------+-----------------+
+|10 |80 |63 |1.066553 |
++---------------+---------------+---------------+-----------------+
+|11 |88 |79 |1.454545 |
++---------------+---------------+---------------+-----------------+
+|12 |96 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|13 |104 |95 |1.230769 |
++---------------+---------------+---------------+-----------------+
+|14 |112 |95 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|15 |120 |95 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|16 |128 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|17 |136 |127 |1.254863 |
++---------------+---------------+---------------+-----------------+
+|18 |144 |127 |1.185255 |
++---------------+---------------+---------------+-----------------+
+|19 |152 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|20 |160 |127 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|21 |168 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|22 |176 |159 |1.454334 |
++---------------+---------------+---------------+-----------------+
+|23 |184 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|24 |192 |127 |0.969744 |
++---------------+---------------+---------------+-----------------+
+|25 |200 |191 |1.280246 |
++---------------+---------------+---------------+-----------------+
+|26 |208 |191 |1.230921 |
++---------------+---------------+---------------+-----------------+
+|27 |216 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|28 |224 |191 |1.143118 |
++---------------+---------------+---------------+-----------------+
+
+If rmid > rmid threshold, MBM total and local values should be multiplied
+by the correction factor.
+
+See:
+
+1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
+http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
+
+2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
+http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
+
+3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
+https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
+
+for further information.
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
index 6b30e43a0d11..67cb68ea6e68 100644
--- a/Documentation/filesystems/smb/ksmbd.rst
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -13,7 +13,7 @@ KSMBD architecture
The subset of performance related operations belong in kernelspace and
the other subset which belong to operations which are not really related with
performance in userspace. So, DCE/RPC management that has historically resulted
-into number of buffer overflow issues and dangerous security bugs and user
+into a number of buffer overflow issues and dangerous security bugs and user
account management are implemented in user space as ksmbd.mountd.
File operations that are related with performance (open/read/write/close etc.)
in kernel space (ksmbd). This also allows for easier integration with VFS
@@ -24,8 +24,8 @@ ksmbd (kernel daemon)
When the server daemon is started, It starts up a forker thread
(ksmbd/interface name) at initialization time and open a dedicated port 445
-for listening to SMB requests. Whenever new clients make request, Forker
-thread will accept the client connection and fork a new thread for dedicated
+for listening to SMB requests. Whenever new clients make a request, the Forker
+thread will accept the client connection and fork a new thread for a dedicated
communication channel between the client and the server. It allows for parallel
processing of SMB requests(commands) from clients as well as allowing for new
clients to make new connections. Each instance is named ksmbd/1~n(port number)
@@ -34,12 +34,12 @@ thread can decide to pass through the commands to the user space (ksmbd.mountd),
currently DCE/RPC commands are identified to be handled through the user space.
To further utilize the linux kernel, it has been chosen to process the commands
as workitems and to be executed in the handlers of the ksmbd-io kworker threads.
-It allows for multiplexing of the handlers as the kernel take care of initiating
+It allows for multiplexing of the handlers as the kernel takes care of initiating
extra worker threads if the load is increased and vice versa, if the load is
-decreased it destroys the extra worker threads. So, after connection is
-established with client. Dedicated ksmbd/1..n(port number) takes complete
+decreased it destroys the extra worker threads. So, after the connection is
+established with the client. Dedicated ksmbd/1..n(port number) takes complete
ownership of receiving/parsing of SMB commands. Each received command is worked
-in parallel i.e., There can be multiple clients commands which are worked in
+in parallel i.e., there can be multiple client commands which are worked in
parallel. After receiving each command a separated kernel workitem is prepared
for each command which is further queued to be handled by ksmbd-io kworkers.
So, each SMB workitem is queued to the kworkers. This allows the benefit of load
@@ -49,9 +49,9 @@ performance by handling client commands in parallel.
ksmbd.mountd (user space daemon)
--------------------------------
-ksmbd.mountd is userspace process to, transfer user account and password that
+ksmbd.mountd is a userspace process to, transfer the user account and password that
are registered using ksmbd.adduser (part of utils for user space). Further it
-allows sharing information parameters that parsed from smb.conf to ksmbd in
+allows sharing information parameters that are parsed from smb.conf to ksmbd in
kernel. For the execution part it has a daemon which is continuously running
and connected to the kernel interface using netlink socket, it waits for the
requests (dcerpc and share/user info). It handles RPC calls (at a minimum few
@@ -124,7 +124,7 @@ How to run
1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and
compile them.
- - Refer README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
+ - Refer to README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
to know how to use ksmbd.mountd/adduser/addshare/control utils
$ ./autogen.sh
@@ -133,7 +133,7 @@ How to run
2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file.
- - Refer ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
+ - Refer to ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
for details to configure shares.
$ man ksmbd.conf
@@ -145,7 +145,7 @@ How to run
$ man ksmbd.adduser
$ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access>
-4. Insert ksmbd.ko module after build your kernel. No need to load module
+4. Insert the ksmbd.ko module after you build your kernel. No need to load the module
if ksmbd is built into the kernel.
- Set ksmbd in menuconfig(e.g. $ make menuconfig)
@@ -175,7 +175,7 @@ Each layer
1. Enable all component prints
# sudo ksmbd.control -d "all"
-2. Enable one of components (smb, auth, vfs, oplock, ipc, conn, rdma)
+2. Enable one of the components (smb, auth, vfs, oplock, ipc, conn, rdma)
# sudo ksmbd.control -d "smb"
3. Show what prints are enabled.
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index 4af8d6207509..45653b3228f9 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -6,7 +6,7 @@ Squashfs 4.0 Filesystem
Squashfs is a compressed read-only filesystem for Linux.
-It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
+It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
maximum of 1Mbytes (default block size 128K).
@@ -16,8 +16,8 @@ use (i.e. in cases where a .tar.gz file may be used), and in constrained
block device/memory systems (e.g. embedded systems) where low overhead is
needed.
-Mailing list: squashfs-devel@lists.sourceforge.net
-Web site: www.squashfs.org
+Mailing list (kernel code): linux-fsdevel@vger.kernel.org
+Web site: github.com/plougher/squashfs-tools
1. Filesystem Features
----------------------
@@ -58,11 +58,9 @@ inodes have different sizes).
As squashfs is a read-only filesystem, the mksquashfs program must be used to
create populated squashfs filesystems. This and other squashfs utilities
-can be obtained from http://www.squashfs.org. Usage instructions can be
-obtained from this site also.
-
-The squashfs-tools development tree is now located on kernel.org
- git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+are very likely packaged by your linux distribution (called squashfs-tools).
+The source code can be obtained from github.com/plougher/squashfs-tools.
+Usage instructions can also be obtained from this site.
2.1 Mount options
-----------------
diff --git a/Documentation/filesystems/sysv-fs.rst b/Documentation/filesystems/sysv-fs.rst
deleted file mode 100644
index 89e40911ad7c..000000000000
--- a/Documentation/filesystems/sysv-fs.rst
+++ /dev/null
@@ -1,264 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==================
-SystemV Filesystem
-==================
-
-It implements all of
- - Xenix FS,
- - SystemV/386 FS,
- - Coherent FS.
-
-To install:
-
-* Answer the 'System V and Coherent filesystem support' question with 'y'
- when configuring the kernel.
-* To mount a disk or a partition, use::
-
- mount [-r] -t sysv device mountpoint
-
- The file system type names::
-
- -t sysv
- -t xenix
- -t coherent
-
- may be used interchangeably, but the last two will eventually disappear.
-
-Bugs in the present implementation:
-
-- Coherent FS:
-
- - The "free list interleave" n:m is currently ignored.
- - Only file systems with no filesystem name and no pack name are recognized.
- (See Coherent "man mkfs" for a description of these features.)
-
-- SystemV Release 2 FS:
-
- The superblock is only searched in the blocks 9, 15, 18, which
- corresponds to the beginning of track 1 on floppy disks. No support
- for this FS on hard disk yet.
-
-
-These filesystems are rather similar. Here is a comparison with Minix FS:
-
-* Linux fdisk reports on partitions
-
- - Minix FS 0x81 Linux/Minix
- - Xenix FS ??
- - SystemV FS ??
- - Coherent FS 0x08 AIX bootable
-
-* Size of a block or zone (data allocation unit on disk)
-
- - Minix FS 1024
- - Xenix FS 1024 (also 512 ??)
- - SystemV FS 1024 (also 512 and 2048)
- - Coherent FS 512
-
-* General layout: all have one boot block, one super block and
- separate areas for inodes and for directories/data.
- On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
- all the block numbers (including the super block) are offset by one track.
-
-* Byte ordering of "short" (16 bit entities) on disk:
-
- - Minix FS little endian 0 1
- - Xenix FS little endian 0 1
- - SystemV FS little endian 0 1
- - Coherent FS little endian 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Byte ordering of "long" (32 bit entities) on disk:
-
- - Minix FS little endian 0 1 2 3
- - Xenix FS little endian 0 1 2 3
- - SystemV FS little endian 0 1 2 3
- - Coherent FS PDP-11 2 3 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Inode on disk: "short", 0 means non-existent, the root dir ino is:
-
- ================================= ==
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
- ================================= ==
-
-* Maximum number of hard links to a file:
-
- =========== =========
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
- =========== =========
-
-* Free inode management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- There is a cache of a certain number of free inodes in the super-block.
- When it is exhausted, new free inodes are found using a linear search.
-
-* Free block management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- Free blocks are organized in a "free list". Maybe a misleading term,
- since it is not true that every free block contains a pointer to
- the next free block. Rather, the free blocks are organized in chunks
- of limited size, and every now and then a free block contains pointers
- to the free blocks pertaining to the next chunk; the first of these
- contains pointers and so on. The list terminates with a "block number"
- 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
-
-* Super-block location:
-
- =========== ==========================
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
- =========== ==========================
-
-* Super-block layout:
-
- - Minix FS::
-
- unsigned short s_ninodes;
- unsigned short s_nzones;
- unsigned short s_imap_blocks;
- unsigned short s_zmap_blocks;
- unsigned short s_firstdatazone;
- unsigned short s_log_zone_size;
- unsigned long s_max_size;
- unsigned short s_magic;
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short s_firstdatazone;
- unsigned long s_nzones;
- unsigned short s_fzone_count;
- unsigned long s_fzones[NICFREE];
- unsigned short s_finode_count;
- unsigned short s_finodes[NICINOD];
- char s_flock;
- char s_ilock;
- char s_modified;
- char s_rdonly;
- unsigned long s_time;
- short s_dinfo[4]; -- SystemV FS only
- unsigned long s_free_zones;
- unsigned short s_free_inodes;
- short s_dinfo[4]; -- Xenix FS only
- unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
- char s_fname[6];
- char s_fpack[6];
-
- then they differ considerably:
-
- Xenix FS::
-
- char s_clean;
- char s_fill[371];
- long s_magic;
- long s_type;
-
- SystemV FS::
-
- long s_fill[12 or 14];
- long s_state;
- long s_magic;
- long s_type;
-
- Coherent FS::
-
- unsigned long s_unique;
-
- Note that Coherent FS has no magic.
-
-* Inode layout:
-
- - Minix FS::
-
- unsigned short i_mode;
- unsigned short i_uid;
- unsigned long i_size;
- unsigned long i_time;
- unsigned char i_gid;
- unsigned char i_nlinks;
- unsigned short i_zone[7+1+1];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short i_mode;
- unsigned short i_nlink;
- unsigned short i_uid;
- unsigned short i_gid;
- unsigned long i_size;
- unsigned char i_zone[3*(10+1+1+1)];
- unsigned long i_atime;
- unsigned long i_mtime;
- unsigned long i_ctime;
-
-
-* Regular file data blocks are organized as
-
- - Minix FS:
-
- - 7 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
-
- - Xenix FS, SystemV FS, Coherent FS:
-
- - 10 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
- - 1 triple-indirect block (pointer to pointers to pointers to blocks)
-
-
- =========== ========== ================
- Inode size inodes per block
- =========== ========== ================
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
- =========== ========== ================
-
-* Directory entry on disk
-
- - Minix FS::
-
- unsigned short inode;
- char name[14/30];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short inode;
- char name[14];
-
- =========== ============== =====================
- Dir entry size dir entries per block
- =========== ============== =====================
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
- =========== ============== =====================
-
-* How to implement symbolic links such that the host fsck doesn't scream:
-
- - Minix FS normal
- - Xenix FS kludge: as regular files with chmod 1000
- - SystemV FS ??
- - Coherent FS kludge: as regular files with chmod 1000
-
-
-Notation: We often speak of a "block" but mean a zone (the allocation unit)
-and not the disk driver's notion of "block".
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe..d677e0428c3f 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -241,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
+tmpfs has the following mounting options for case-insensitive lookup support:
+
+================= ==============================================================
+casefold Enable casefold support at this mount point using the given
+ argument as the encoding standard. Currently only UTF-8
+ encodings are supported. If no argument is used, it will load
+ the latest UTF-8 encoding available.
+strict_encoding Enable strict encoding at this mount point (disabled by
+ default). In this mode, the filesystem refuses to create file
+ and directory with names containing invalid UTF-8 characters.
+================= ==============================================================
+
+This option doesn't render the entire filesystem case-insensitive. One needs to
+still set the casefold flag per directory, by flipping the +F attribute in an
+empty directory. Nevertheless, new directories will inherit the attribute. The
+mountpoint itself cannot be made case-insensitive.
+
+Example::
+
+ $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs
+ $ mount -t tmpfs -o casefold fs_name /mytmpfs
+
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
@@ -250,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
KOSAKI Motohiro, 16 Mar 2010
:Updated:
Chris Down, 13 July 2020
+:Updated:
+ André Almeida, 23 Aug 2024
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 6e903a903f8f..fd32a9a17bfb 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -495,7 +495,7 @@ As of kernel 2.6.22, the following members are defined:
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
@@ -562,7 +562,26 @@ otherwise noted.
``mkdir``
called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
- call d_instantiate() just as you would in the create() method
+ call d_instantiate_new() just as you would in the create() method.
+
+ If d_instantiate_new() is not used and if the fh_to_dentry()
+ export operation is provided, or if the storage might be
+ accessible by another path (e.g. with a network filesystem)
+ then more care may be needed. Importantly d_instantate()
+ should not be used with an inode that is no longer I_NEW if there
+ any chance that the inode could already be attached to a dentry.
+ This is because of a hard rule in the VFS that a directory must
+ only ever have one dentry.
+
+ For example, if an NFS filesystem is mounted twice the new directory
+ could be visible on the other mount before it is on the original
+ mount, and a pair of name_to_handle_at(), open_by_handle_at()
+ calls could instantiate the directory inode with an IS_ROOT()
+ dentry before the first mkdir returns.
+
+ If there is any chance this could happen, then the new inode
+ should be d_drop()ed and attached with d_splice_alias(). The
+ returned dentry (if any) should be returned by ->mkdir().
``rmdir``
called by the rmdir(2) system call. Only required if you want
@@ -697,9 +716,8 @@ page lookup by address, and keeping track of pages tagged as Dirty or
Writeback.
The first can be used independently to the others. The VM can try to
-either write dirty pages in order to clean them, or release clean pages
-in order to reuse them. To do this it can call the ->writepage method
-on dirty pages, and ->release_folio on clean folios with the private
+release clean pages in order to reuse them. To do this it can call
+->release_folio on clean folios with the private
flag set. Clean pages without PagePrivate and with no external references
will be released without notice being given to the address_space.
@@ -712,8 +730,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each
page, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default
-->writepages method. It uses the tag to find dirty pages to call
-->writepage on. If mpage_writepages is not used (i.e. the address
+->writepages method. It uses the tag to find dirty pages to
+write back. If mpage_writepages is not used (i.e. the address
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
unused. write_inode_now and sync_inode do use it (through
__sync_single_inode) to check if ->writepages has been successful in
@@ -737,23 +755,23 @@ pages, however the address_space has finer control of write sizes.
The read process essentially only requires 'read_folio'. The write
process is more complicated and uses write_begin/write_end or
-dirty_folio to write data into the address_space, and writepage and
+dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by the
inode's i_mutex.
When data is written to a page, the PG_Dirty flag should be set. It
-typically remains set until writepage asks for it to be written. This
+typically remains set until writepages asks for it to be written. This
should clear PG_Dirty and set PG_Writeback. It can be actually written
at any point after PG_Dirty is clear. Once it is known to be safe,
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the writepage and writepages operations some
+operations. This gives the writepages operation some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
-return information back to the caller about the result of a writepage or
+return information back to the caller about the result of a
writepages request.
@@ -800,7 +818,6 @@ cache in your filesystem. The following members are defined:
.. code-block:: c
struct address_space_operations {
- int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
@@ -810,7 +827,7 @@ cache in your filesystem. The following members are defined:
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -829,25 +846,6 @@ cache in your filesystem. The following members are defined:
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
-``writepage``
- called by the VM to write a dirty page to backing store. This
- may happen for data integrity reasons (i.e. 'sync'), or to free
- up memory (flush). The difference can be seen in
- wbc->sync_mode. The PG_Dirty flag has been cleared and
- PageLocked is true. writepage should start writeout, should set
- PG_Writeback, and should make sure the page is unlocked, either
- synchronously or asynchronously when the write operation
- completes.
-
- If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
- try too hard if there are problems, and may choose to write out
- other pages from the mapping if that is easier (e.g. due to
- internal dependencies). If it chooses not to start writeout, it
- should return AOP_WRITEPAGE_ACTIVATE so that the VM will not
- keep calling ->writepage on that page.
-
- See the file "Locking" for more details.
-
``read_folio``
Called by the page cache to read a folio from the backing store.
The 'file' argument supplies authentication information to network
@@ -890,7 +888,7 @@ cache in your filesystem. The following members are defined:
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
- DIRTY and will pass them to ->writepage.
+ DIRTY and will write them back.
``dirty_folio``
called by the VM to mark a folio as dirty. This is particularly
@@ -913,8 +911,7 @@ cache in your filesystem. The following members are defined:
stop attempting I/O, it can simply return. The caller will
remove the remaining pages from the address space, unlock them
and decrement the page refcount. Set PageUptodate if the I/O
- completes successfully. Setting PageError on any page will be
- ignored; simply unlock the page if an I/O error occurs.
+ completes successfully.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -926,12 +923,12 @@ cache in your filesystem. The following members are defined:
(if they haven't been read already) so that the updated blocks
can be written out properly.
- The filesystem must return the locked pagecache page for the
- specified offset, in ``*pagep``, for the caller to write into.
+ The filesystem must return the locked pagecache folio for the
+ specified offset, in ``*foliop``, for the caller to write into.
It must be able to cope with short writes (where the length
passed to write_begin is greater than the number of bytes copied
- into the page).
+ into the folio).
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -944,8 +941,8 @@ cache in your filesystem. The following members are defined:
called. len is the original len passed to write_begin, and
copied is the amount that was able to be copied.
- The filesystem must take care of unlocking the page and
- releasing it refcount, and updating i_size.
+ The filesystem must take care of unlocking the folio,
+ decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=
'copied') that were able to be copied into pagecache.
@@ -1252,7 +1249,8 @@ defined:
.. code-block:: c
struct dentry_operations {
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -1265,6 +1263,8 @@ defined:
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
};
``d_revalidate``
@@ -1390,9 +1390,7 @@ defined:
If a vfsmount is returned, the caller will attempt to mount it
on the mountpoint and will remove the vfsmount from its
- expiration list in the case of failure. The vfsmount should be
- returned with 2 refs on it to prevent automatic expiration - the
- caller will clean up the additional ref.
+ expiration list in the case of failure.
This function is only used if DCACHE_NEED_AUTOMOUNT is set on
the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is
@@ -1428,6 +1426,25 @@ defined:
For non-regular files, the 'dentry' argument is returned.
+``d_unalias_trylock``
+ if present, will be called by d_splice_alias() before moving a
+ preexisting attached alias. Returning false prevents __d_move(),
+ making d_splice_alias() fail with -ESTALE.
+
+ Rationale: setting FS_RENAME_DOES_D_MOVE will prevent d_move()
+ and d_exchange() calls from the outside of filesystem methods;
+ however, it does not guarantee that attached dentries won't
+ be renamed or moved by d_splice_alias() finding a preexisting
+ alias for a directory inode. Normally we would not care;
+ however, something that wants to stabilize the entire path to
+ root over a blocking operation might need that. See 9p for one
+ (and hopefully only) example.
+
+``d_unalias_unlock``
+ should be paired with ``d_unalias_trylock``; that one is called after
+ __d_move() call in __d_unalias().
+
+
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
directory.
diff --git a/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
index 6402ab8e370c..2a2705e975e8 100644
--- a/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
@@ -219,7 +219,7 @@ The log is circular, so the positions in the log are defined by the combination
of a cycle number - the number of times the log has been overwritten - and the
offset into the log. A LSN carries the cycle in the upper 32 bits and the
offset in the lower 32 bits. The offset is in units of "basic blocks" (512
-bytes). Hence we can do realtively simple LSN based math to keep track of
+bytes). Hence we can do relatively simple LSN based math to keep track of
available space in the log.
Log space accounting is done via a pair of constructs called "grant heads". The
diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
index 32b6ac4ca9d6..ce4584fb3103 100644
--- a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
+++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
@@ -93,7 +93,7 @@ others on a regular basis about burnout.
sponsoring work on any part of XFS.
- **LTS Maintainer**: Someone who backports and tests bug fixes from
- uptream to the LTS kernels.
+ upstream to the LTS kernels.
There tend to be six separate LTS trees at any given time.
The maintainer for a given LTS release should identify themselves with an
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 6333697ba3e8..e231d127cd40 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed.
The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.
@@ -2802,7 +2802,8 @@ follows this format:
Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.
@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.
@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file. The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
There is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3974,8 +3976,8 @@ The proposed patches are in the
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
series.
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.
The proposed patchset is the
-`atomic extent swap
+`file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
series.
@@ -4047,9 +4053,6 @@ series.
| one ``struct rw_semaphore`` for each feature. |
| The log cleaning code tries to take this rwsem in exclusive mode to |
| clear the bit; if the lock attempt fails, the feature bit remains set. |
-| Filesystem code signals its intention to use a log incompat feature in a |
-| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
-| in shared mode. |
| The code supporting a log incompat feature should create wrapper |
| functions to obtain the log feature and call |
| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
@@ -4064,72 +4067,73 @@ series.
| The feature bit will not be cleared from the superblock until the log |
| becomes clean. |
| |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the |
| functionality. |
+--------------------------------------------------------------------------+
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
.. code-block:: c
- struct xfs_swapext_intent {
+ struct xfs_exchmaps_intent {
/* Inodes participating in the operation. */
- struct xfs_inode *sxi_ip1;
- struct xfs_inode *sxi_ip2;
+ struct xfs_inode *xmi_ip1;
+ struct xfs_inode *xmi_ip2;
/* File offset range information. */
- xfs_fileoff_t sxi_startoff1;
- xfs_fileoff_t sxi_startoff2;
- xfs_filblks_t sxi_blockcount;
+ xfs_fileoff_t xmi_startoff1;
+ xfs_fileoff_t xmi_startoff2;
+ xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
- xfs_fsize_t sxi_isize1;
- xfs_fsize_t sxi_isize2;
+ xfs_fsize_t xmi_isize1;
+ xfs_fsize_t xmi_isize2;
- /* XFS_SWAP_EXT_* log operation flags */
- uint64_t sxi_flags;
+ /* XFS_EXCHMAPS_* log operation flags */
+ uint64_t xmi_flags;
};
The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
-
-When the extent swap is initiated, the sequence of operations is as follows:
-
-1. Create a deferred work item for the extent swap.
- At the start, it should contain the entirety of the file ranges to be
- swapped.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
+
+When the exchange is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the file mapping exchange.
+ At the start, it should contain the entirety of the file block ranges to be
+ exchanged.
2. Call ``xfs_defer_finish`` to process the exchange.
- This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+ This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
- extent swap work item.
+ mapping exchange work item.
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
- a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
- ``sxi_startoff2``, respectively, and compute the longest extent that can
- be swapped in a single step.
+ a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+ ``xmi_startoff2``, respectively, and compute the longest extent that can
+ be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
@@ -4151,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
g. Extend the ondisk size of either file if necessary.
- h. Log an extent swap done log item for the extent swap intent log item
- that was read at the start of step 3.
+ h. Log a mapping exchange done log item for th mapping exchange intent log
+ item that was read at the start of step 3.
i. Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
- sxi_startoff1)``, because step 3a could have skipped holes.
+ xmi_startoff1)``, because step 3a could have skipped holes.
- j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+ j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
- ``sxi_blockcount`` by the same quantity.
+ ``xmi_blockcount`` by the same quantity.
This advances the cursor.
- k. Log a new extent swap intent log item reflecting the advanced state of
- the work item.
+ k. Log a new mapping exchange intent log item reflecting the advanced state
+ of the work item.
l. Return the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
@@ -4175,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:
- Data device blocks needed to handle the repeated updates to the fork
@@ -4204,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
Extended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:
- If both forks are in local format and the fork areas are large enough, the
- swap is performed by copying the incore fork contents, logging both forks,
- and committing.
- The atomic extent swap mechanism is not necessary, since this can be done
- with a single transaction.
+ exchange is performed by copying the incore fork contents, logging both
+ forks, and committing.
+ The atomic file mapping exchange mechanism is not necessary, since this can
+ be done with a single transaction.
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+ used.
- Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
- swap.
+ exchange.
The conversion to block format must be done in the same transaction that
- logs the initial extent swap intent log item.
- The regular atomic extent swap is used to exchange the mappings.
- Special flags are set on the swap operation so that the transaction can be
- rolled one more time to convert the second file's fork back to local format
- so that the second file will be ready to go as soon as the ILOCK is dropped.
+ logs the initial mapping exchange intent log item.
+ The regular atomic mapping exchange is used to exchange the metadata file
+ mappings.
+ Special flags are set on the exchange operation so that the transaction can
+ be rolled one more time to convert the second file's fork back to local
+ format so that the second file will be ready to go as soon as the ILOCK is
+ dropped.
Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
To repair a metadata file, online repair proceeds as follows:
@@ -4263,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
file.
The same fork must be written to as is being repaired.
-3. Commit the scrub transaction, since the swap estimation step must be
- completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+ must be completed before transaction reservations are made.
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
- xfs_swapext_req`` with the details of the swap operation.
+ xfs_exchmaps_req`` with the details of the exchange operation.
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
6. Commit the transaction to complete the repair.
@@ -4312,7 +4320,7 @@ To check the summary file against the bitmap:
3. Compare the contents of the xfile against the ondisk file.
To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.
The proposed patchset is the
@@ -4355,8 +4363,8 @@ Salvaging extended attributes is done as follows:
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.
-3. Use atomic extent swapping to exchange the new and old extended attribute
- structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+ attribute structures.
The old attribute blocks are now attached to the temporary file.
4. Reap the temporary file.
@@ -4413,7 +4421,8 @@ salvaging directories is straightforward:
directory and add the staged dirents into the temporary directory.
Truncate the staging files.
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+ structures.
The old directory blocks are now attached to the temporary file.
5. Reap the temporary file.
@@ -4456,10 +4465,10 @@ reconstruction of filesystem space metadata.
The parent pointer feature, however, makes total directory reconstruction
possible.
-XFS parent pointers include the dirent name and location of the entry within
-the parent directory.
+XFS parent pointers contain the information needed to identify the
+corresponding directory entry in the parent directory.
In other words, child files use extended attributes to store pointers to
-parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
+parents in the form ``(dirent_name) → (parent_inum, parent_gen)``.
The directory checking process can be strengthened to ensure that the target of
each dirent also contains a parent pointer pointing back to the dirent.
Likewise, each parent pointer can be checked by ensuring that the target of
@@ -4467,8 +4476,6 @@ each parent pointer is a directory and that it contains a dirent matching
the parent pointer.
Both online and offline repair can use this strategy.
-**Note**: The ondisk format of parent pointers is not yet finalized.
-
+--------------------------------------------------------------------------+
| **Historical Sidebar**: |
+--------------------------------------------------------------------------+
@@ -4510,8 +4517,58 @@ Both online and offline repair can use this strategy.
| Chandan increased the maximum extent counts of both data and attribute |
| forks, thereby ensuring that the extended attribute structure can grow |
| to handle the maximum hardlink count of any file. |
+| |
+| For this second effort, the ondisk parent pointer format as originally |
+| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. |
+| The format was changed during development to eliminate the requirement |
+| of repair tools needing to ensure that the ``dirent_pos`` field always |
+| matched when reconstructing a directory. |
+| |
+| There were a few other ways to have solved that problem: |
+| |
+| 1. The field could be designated advisory, since the other three values |
+| are sufficient to find the entry in the parent. |
+| However, this makes indexed key lookup impossible while repairs are |
+| ongoing. |
+| |
+| 2. We could allow creating directory entries at specified offsets, which |
+| solves the referential integrity problem but runs the risk that |
+| dirent creation will fail due to conflicts with the free space in the |
+| directory. |
+| |
+| These conflicts could be resolved by appending the directory entry |
+| and amending the xattr code to support updating an xattr key and |
+| reindexing the dabtree, though this would have to be performed with |
+| the parent directory still locked. |
+| |
+| 3. Same as above, but remove the old parent pointer entry and add a new |
+| one atomically. |
+| |
+| 4. Change the ondisk xattr format to |
+| ``(parent_inum, name) → (parent_gen)``, which would provide the attr |
+| name uniqueness that we require, without forcing repair code to |
+| update the dirent position. |
+| Unfortunately, this requires changes to the xattr code to support |
+| attr names as long as 263 bytes. |
+| |
+| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → |
+| (name, parent_gen)``. |
+| If the hash is sufficiently resistant to collisions (e.g. sha256) |
+| then this should provide the attr name uniqueness that we require. |
+| Names shorter than 247 bytes could be stored directly. |
+| |
+| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, |
+| parent_gen)``. This format doesn't require any of the complicated |
+| nested name hashing of the previous suggestions. However, it was |
+| discovered that multiple hardlinks to the same inode with the same |
+| filename caused performance problems with hashed xattr lookups, so |
+| the parent inumber is now xor'd into the hash index. |
+| |
+| In the end, it was decided that solution #6 was the most compact and the |
+| most performant. A new hash function was designed for parent pointers. |
+--------------------------------------------------------------------------+
+
Case Study: Repairing Directories with Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4519,8 +4576,9 @@ Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
a :ref:`directory entry live update hook <liveupdate>` as follows:
1. Set up a temporary directory for generating the new directory structure,
- an xfblob for storing entry names, and an xfarray for stashing directory
- updates.
+ an xfblob for storing entry names, and an xfarray for stashing the fixed
+ size fields involved in a directory update: ``(child inumber, add vs.
+ remove, name cookie, ftype)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4529,73 +4587,36 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
pointer references the directory of interest.
If so:
- a. Stash an addname entry for this dirent in the xfarray for later.
+ a. Stash the parent pointer name and an addname entry for this dirent in the
+ xfblob and xfarray, respectively.
- b. When finished scanning that file, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning that file or the kernel memory consumption exceeds
+ a threshold, flush the stashed updates to the temporary directory.
4. For each live directory update received via the hook, decide if the child
has already been scanned.
If so:
- a. Stash an addname or removename entry for this dirent update in the
- xfarray for later.
+ a. Stash the parent pointer name an addname or removename entry for this
+ dirent update in the xfblob and xfarray for later.
We cannot write directly to the temporary directory because hook
functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
-6. Reap the temporary directory.
-
-7. Update the dirent position field of parent pointers as necessary.
- This may require the queuing of a substantial number of xattr log intent
- items.
+7. Reap the temporary directory.
The proposed patchset is the
`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
-**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
-match in the reconstructed directory?
-
-*Answer*: There are a few ways to solve this problem:
-
-1. The field could be designated advisory, since the other three values are
- sufficient to find the entry in the parent.
- However, this makes indexed key lookup impossible while repairs are ongoing.
-
-2. We could allow creating directory entries at specified offsets, which solves
- the referential integrity problem but runs the risk that dirent creation
- will fail due to conflicts with the free space in the directory.
-
- These conflicts could be resolved by appending the directory entry and
- amending the xattr code to support updating an xattr key and reindexing the
- dabtree, though this would have to be performed with the parent directory
- still locked.
-
-3. Same as above, but remove the old parent pointer entry and add a new one
- atomically.
-
-4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
- which would provide the attr name uniqueness that we require, without
- forcing repair code to update the dirent position.
- Unfortunately, this requires changes to the xattr code to support attr
- names as long as 263 bytes.
-
-5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
- (name, parent_gen)``.
- If the hash is sufficiently resistant to collisions (e.g. sha256) then
- this should provide the attr name uniqueness that we require.
- Names shorter than 247 bytes could be stored directly.
-
-Discussion is ongoing under the `parent pointers patch deluge
-<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
-
Case Study: Repairing Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4603,8 +4624,9 @@ Online reconstruction of a file's parent pointer information works similarly to
directory reconstruction:
1. Set up a temporary file for generating a new extended attribute structure,
- an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
- stashing parent pointer updates.
+ an xfblob for storing parent pointer names, and an xfarray for stashing the
+ fixed size fields involved in a parent pointer update: ``(parent inumber,
+ parent generation, add vs. remove, name cookie)``.
2. Set up an inode scanner and hook into the directory entry code to receive
updates on directory operations.
@@ -4613,34 +4635,36 @@ directory reconstruction:
dirent references the file of interest.
If so:
- a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
- for later.
+ a. Stash the dirent name and an addpptr entry for this parent pointer in the
+ xfblob and xfarray, respectively.
- b. When finished scanning the directory, flush the stashed updates to the
- temporary directory.
+ b. When finished scanning the directory or the kernel memory consumption
+ exceeds a threshold, flush the stashed updates to the temporary file.
4. For each live directory update received via the hook, decide if the parent
has already been scanned.
If so:
- a. Stash an addpptr or removepptr entry for this dirent update in the
- xfarray for later.
+ a. Stash the dirent name and an addpptr or removepptr entry for this dirent
+ update in the xfblob and xfarray for later.
We cannot write parent pointers directly to the temporary file because
hook functions are not allowed to modify filesystem metadata.
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed parent pointer updates to the temporary file.
-5. Copy all non-parent pointer extended attributes to the temporary file.
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. Copy all non-parent pointer extended attributes to the temporary file.
-6. When the scan is complete, atomically swap the attribute fork of the
- temporary file and the file being repaired.
+7. When the scan is complete, atomically exchange the mappings of the attribute
+ forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.
-7. Reap the temporary file.
+8. Reap the temporary file.
The proposed patchset is the
`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
series.
Digression: Offline Checking of Parent Pointers
@@ -4651,26 +4675,56 @@ files are erased long before directory tree connectivity checks are performed.
Parent pointer checks are therefore a second pass to be added to the existing
connectivity checks:
-1. After the set of surviving files has been established (i.e. phase 6),
+1. After the set of surviving files has been established (phase 6),
walk the surviving directories of each AG in the filesystem.
This is already performed as part of the connectivity checks.
-2. For each directory entry found, record the name in an xfblob, and store
- ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
- per-AG in-memory slab.
+2. For each directory entry found,
+
+ a. If the name has already been stored in the xfblob, then use that cookie
+ and skip the next step.
+
+ b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
+ Unique mappings are critical for
+
+ 1. Deduplicating names to reduce memory usage, and
+
+ 2. Creating a stable sort key for the parent pointer indexes so that the
+ parent pointer validation described below will work.
+
+ c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
+ referenced in this section is the regular directory entry name hash, not
+ the specialized one used for parent pointer xattrs.
3. For each AG in the filesystem,
- a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
- dirent_pos.
+ a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
+ ``name_hash``, and ``name_cookie``.
+ Having a single ``name_cookie`` for each ``name`` is critical for
+ handling the uncommon case of a directory containing multiple hardlinks
+ to the same file where all the names hash to the same value.
b. For each inode in the AG,
1. Scan the inode for parent pointers.
- Record the names in a per-file xfblob, and store ``(parent_inum,
- parent_gen, dirent_pos)`` tuples in a per-file slab.
+ For each parent pointer found,
+
+ a. Validate the ondisk parent pointer.
+ If validation fails, move on to the next parent pointer in the
+ file.
+
+ b. If the name has already been stored in the xfblob, then use that
+ cookie and skip the next step.
+
+ c. Record the name in a per-file xfblob, and remember the xfblob
+ cookie.
- 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
+ d. Store ``(parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-file slab.
+
+ 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
+ and ``name_cookie``.
3. Position one slab cursor at the start of the inode's records in the
per-AG tuple slab.
@@ -4679,28 +4733,37 @@ connectivity checks:
4. Position a second slab cursor at the start of the per-file tuple slab.
- 5. Iterate the two cursors in lockstep, comparing the parent_ino and
- dirent_pos fields of the records under each cursor.
+ 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
+ ``name_hash``, and ``name_cookie`` fields of the records under each
+ cursor:
- a. Tuples in the per-AG list but not the per-file list are missing and
- need to be written to the inode.
+ a. If the per-AG cursor is at a lower point in the keyspace than the
+ per-file cursor, then the per-AG cursor points to a missing parent
+ pointer.
+ Add the parent pointer to the inode and advance the per-AG
+ cursor.
- b. Tuples in the per-file list but not the per-AG list are dangling
- and need to be removed from the inode.
+ b. If the per-file cursor is at a lower point in the keyspace than
+ the per-AG cursor, then the per-file cursor points to a dangling
+ parent pointer.
+ Remove the parent pointer from the inode and advance the per-file
+ cursor.
- c. For tuples in both lists, update the parent_gen and name components
- of the parent pointer if necessary.
+ c. Otherwise, both cursors point at the same parent pointer.
+ Update the parent_gen component if necessary.
+ Advance both cursors.
4. Move on to examining link counts, as we do today.
The proposed patchset is the
`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
series.
-Rebuilding directories from parent pointers in offline repair is very
-challenging because it currently uses a single-pass scan of the filesystem
-during phase 3 to decide which files are corrupt enough to be zapped.
+Rebuilding directories from parent pointers in offline repair would be very
+challenging because xfs_repair currently uses two single-pass scans of the
+filesystem during phases 3 and 4 to decide which files are corrupt enough to be
+zapped.
This scan would have to be converted into a multi-pass scan:
1. The first pass of the scan zaps corrupt inodes, forks, and attributes
@@ -4722,6 +4785,130 @@ This scan would have to be converted into a multi-pass scan:
This code has not yet been constructed.
+.. _dirtree:
+
+Case Study: Directory Tree Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned earlier, the filesystem directory tree is supposed to be a
+directed acylic graph structure.
+However, each node in this graph is a separate ``xfs_inode`` object with its
+own locks, which makes validating the tree qualities difficult.
+Fortunately, non-directories are allowed to have multiple parents and cannot
+have children, so only directories need to be scanned.
+Directories typically constitute 5-10% of the files in a filesystem, which
+reduces the amount of work dramatically.
+
+If the directory tree could be frozen, it would be easy to discover cycles and
+disconnected regions by running a depth (or breadth) first search downwards
+from the root directory and marking a bitmap for each directory found.
+At any point in the walk, trying to set an already set bit means there is a
+cycle.
+After the scan completes, XORing the marked inode bitmap with the inode
+allocation bitmap reveals disconnected inodes.
+However, one of online repair's design goals is to avoid locking the entire
+filesystem unless it's absolutely necessary.
+Directory tree updates can move subtrees across the scanner wavefront on a live
+filesystem, so the bitmap algorithm cannot be applied.
+
+Directory parent pointers enable an incremental approach to validation of the
+tree structure.
+Instead of using one thread to scan the entire filesystem, multiple threads can
+walk from individual subdirectories upwards towards the root.
+For this to work, all directory entries and parent pointers must be internally
+consistent, each directory entry must have a parent pointer, and the link
+counts of all directories must be correct.
+Each scanner thread must be able to take the IOLOCK of an alleged parent
+directory while holding the IOLOCK of the child directory to prevent either
+directory from being moved within the tree.
+This is not possible since the VFS does not take the IOLOCK of a child
+subdirectory when moving that subdirectory, so instead the scanner stabilizes
+the parent -> child relationship by taking the ILOCKs and installing a dirent
+update hook to detect changes.
+
+The scanning process uses a dirent hook to detect changes to the directories
+mentioned in the scan data.
+The scan works as follows:
+
+1. For each subdirectory in the filesystem,
+
+ a. For each parent pointer of that subdirectory,
+
+ 1. Create a path object for that parent pointer, and mark the
+ subdirectory inode number in the path object's bitmap.
+
+ 2. Record the parent pointer name and inode number in a path structure.
+
+ 3. If the alleged parent is the subdirectory being scrubbed, the path is
+ a cycle.
+ Mark the path for deletion and repeat step 1a with the next
+ subdirectory parent pointer.
+
+ 4. Try to mark the alleged parent inode number in a bitmap in the path
+ object.
+ If the bit is already set, then there is a cycle in the directory
+ tree.
+ Mark the path as a cycle and repeat step 1a with the next subdirectory
+ parent pointer.
+
+ 5. Load the alleged parent.
+ If the alleged parent is not a linked directory, abort the scan
+ because the parent pointer information is inconsistent.
+
+ 6. For each parent pointer of this alleged ancestor directory,
+
+ a. Record the parent pointer name and inode number in the path object
+ if no parent has been set for that level.
+
+ b. If an ancestor has more than one parent, mark the path as corrupt.
+ Repeat step 1a with the next subdirectory parent pointer.
+
+ c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
+ This repeats until the directory tree root is reached or no parents
+ are found.
+
+ 7. If the walk terminates at the root directory, mark the path as ok.
+
+ 8. If the walk terminates without reaching the root, mark the path as
+ disconnected.
+
+2. If the directory entry update hook triggers, check all paths already found
+ by the scan.
+ If the entry matches part of a path, mark that path and the scan stale.
+ When the scanner thread sees that the scan has been marked stale, it deletes
+ all scan data and starts over.
+
+Repairing the directory tree works as follows:
+
+1. Walk each path of the target subdirectory.
+
+ a. Corrupt paths and cycle paths are counted as suspect.
+
+ b. Paths already marked for deletion are counted as bad.
+
+ c. Paths that reached the root are counted as good.
+
+2. If the subdirectory is either the root directory or has zero link count,
+ delete all incoming directory entries in the immediate parents.
+ Repairs are complete.
+
+3. If the subdirectory has exactly one path, set the dotdot entry to the
+ parent and exit.
+
+4. If the subdirectory has at least one good path, delete all the other
+ incoming directory entries in the immediate parents.
+
+5. If the subdirectory has no good paths and more than one suspect path, delete
+ all the other incoming directory entries in the immediate parents.
+
+6. If the subdirectory has zero paths, attach it to the lost and found.
+
+The proposed patches are in the
+`directory tree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
+series.
+
+
.. _orphanage:
The Orphanage
@@ -4769,14 +4956,22 @@ Orphaned files are adopted by the orphanage as follows:
The ``xrep_orphanage_iolock_two`` function follows the inode locking
strategy discussed earlier.
-3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
- to compute the new name in the orphanage and the block reservation required.
-
-4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
+3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair
transaction.
-5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
- and found, and update the kernel dentry cache.
+4. Call ``xrep_orphanage_compute_name`` to compute the new name in the
+ orphanage.
+
+5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to
+ reparent the orphaned file into the lost and found and invalidate the dentry
+ cache.
+
+6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the
+ orphanage ILOCK, and clean the scrub transaction. Call
+ ``xrep_adoption_commit`` to commit the updates and the scrub transaction.
+
+7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
+ resources.
The proposed patches are in the
`orphanage adoption
@@ -5108,18 +5303,18 @@ make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5134,12 +5329,12 @@ the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
@@ -5150,11 +5345,11 @@ The extra flexibility enables several new use cases:
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
- Finally, the process calls the atomic extent swap system call
- (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
- of the updates to the original file, or none of them.
+ Finally, the process calls the atomic file mapping exchange system call
+ (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+ committing all of the updates to the original file, or none of them.
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
- **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
@@ -5163,16 +5358,17 @@ The extra flexibility enables several new use cases:
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
- into the kernel as arguments to the atomic extent swap system call.
+ into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
+ A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
- **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
- atomic extent swap system call with a flag to indicate that holes in the
- temporary file should be ignored.
+ atomic file mapping exchange system call with a flag to indicate that holes
+ in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
@@ -5254,8 +5450,8 @@ of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.