summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/afs.rst2
-rw-r--r--Documentation/filesystems/dax.txt142
-rw-r--r--Documentation/filesystems/debugfs.rst4
-rw-r--r--Documentation/filesystems/f2fs.rst8
-rw-r--r--Documentation/filesystems/fiemap.rst12
-rw-r--r--Documentation/filesystems/fscrypt.rst33
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst (renamed from Documentation/filesystems/gfs2-glocks.txt)149
-rw-r--r--Documentation/filesystems/index.rst1
-rw-r--r--Documentation/filesystems/locking.rst8
-rw-r--r--Documentation/filesystems/overlayfs.rst7
-rw-r--r--Documentation/filesystems/porting.rst7
-rw-r--r--Documentation/filesystems/proc.rst97
-rw-r--r--Documentation/filesystems/vfs.rst15
-rw-r--r--Documentation/filesystems/virtiofs.rst14
-rw-r--r--Documentation/filesystems/xfs-self-describing-metadata.rst10
15 files changed, 398 insertions, 111 deletions
diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst
index c4ec39a5966e..cada9464d6bd 100644
--- a/Documentation/filesystems/afs.rst
+++ b/Documentation/filesystems/afs.rst
@@ -70,7 +70,7 @@ list of volume location server IP addresses::
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
- Documentation/networking/rxrpc.txt
+ Documentation/networking/rxrpc.rst
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 735f3859b19f..8e2670781c9b 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,144 @@ Usage
If you have a block device which supports DAX, you can make a filesystem
on it as usual. The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem. When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them
+is different.
+
+Enabling DAX on ext4 and ext2
+-----------------------------
+
+When mounting the filesystem, use the "-o dax" option on the command line or
+add 'dax' to the options in /etc/fstab. This works to enable DAX on all files
+within the filesystem. It is equivalent to the '-o dax=always' behavior below.
+
+
+Enabling DAX on xfs
+-------------------
+
+Summary
+-------
+
+ 1. There exists an in-kernel file access mode flag S_DAX that corresponds to
+ the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details
+ about this access mode.
+
+ 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
+ files and directories. This advisory flag can be set or cleared at any
+ time, but doing so does not immediately affect the S_DAX state.
+
+ 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
+ be inherited by all regular files and subdirectories that are subsequently
+ created in this directory. Files and subdirectories that exist at the time
+ this flag is set or cleared on the parent directory are not modified by
+ this modification of the parent directory.
+
+ 4. There exist dax mount options which can override FS_XFLAG_DAX in the
+ setting of the S_DAX flag. Given underlying storage which supports DAX the
+ following hold:
+
+ "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
+
+ "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
+
+ "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
+
+ "-o dax" is a legacy option which is an alias for "dax=always".
+ This may be removed in the future so "-o dax=always" is
+ the preferred method for specifying this behavior.
+
+ NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
+ the same even when the filesystem is mounted with a dax option. However,
+ in-core inode state (S_DAX) will be overridden until the filesystem is
+ remounted with dax=inode and the inode is evicted from kernel memory.
+
+ 5. The S_DAX policy can be changed via:
+
+ a) Setting the parent directory FS_XFLAG_DAX as needed before files are
+ created
+
+ b) Setting the appropriate dax="foo" mount option
+
+ c) Changing the FS_XFLAG_DAX flag on existing regular files and
+ directories. This has runtime constraints and limitations that are
+ described in 6) below.
+
+ 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
+ the change in behaviour for existing regular files may not occur
+ immediately. If the change must take effect immediately, the administrator
+ needs to:
+
+ a) stop the application so there are no active references to the data set
+ the policy change will affect
+
+ b) evict the data set from kernel caches so it will be re-instantiated when
+ the application is restarted. This can be achieved by:
+
+ i. drop-caches
+ ii. a filesystem unmount and mount cycle
+ iii. a system reboot
+
+
+Details
+-------
+
+There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)
+and the other is a volatile flag indicating the active state of the feature
+(S_DAX).
+
+FS_XFLAG_DAX is preserved within the filesystem. This persistent config
+setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
+(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
+
+New files and directories automatically inherit FS_XFLAG_DAX from
+their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
+directory creation time can be used to set a default behavior for an entire
+sub-tree.
+
+To clarify inheritance, here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io -c 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io -c 'chattr +x' a
+mkdir -p a/b/c/d
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io -c 'chattr +x' c
+mkdir a/b/c/d
+
+ dax: c,d
+ no dax: a,b
+
+
+The current enabled state (S_DAX) is set when a file inode is instantiated in
+memory by the kernel. It is set based on the underlying media support, the
+value of FS_XFLAG_DAX and the filesystem's dax mount option.
+
+statx can be used to query S_DAX. NOTE that only regular files will ever have
+S_DAX set and therefore statx will never indicate that S_DAX is set on
+directories.
+
+Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
+if the underlying media does not support dax and/or the filesystem is
+overridden with a mount option.
+
Implementation Tips for Block Driver Writers
@@ -94,7 +230,7 @@ sysadmins have an option to restore the lost data from a prior backup/inbuilt
redundancy in the following ways:
1. Delete the affected file, and restore from a backup (sysadmin route):
- This will free the file system blocks that were being used by the file,
+ This will free the filesystem blocks that were being used by the file,
and the next time they're allocated, they will be zeroed first, which
happens through the driver, and will clear bad sectors.
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index 0cfeca376169..1da7a4b7383d 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -79,8 +79,8 @@ created with any of::
struct dentry *parent, u8 *value);
void debugfs_create_u16(const char *name, umode_t mode,
struct dentry *parent, u16 *value);
- struct dentry *debugfs_create_u32(const char *name, umode_t mode,
- struct dentry *parent, u32 *value);
+ void debugfs_create_u32(const char *name, umode_t mode,
+ struct dentry *parent, u32 *value);
void debugfs_create_u64(const char *name, umode_t mode,
struct dentry *parent, u64 *value);
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index 87d794bc75a4..099d45ac8d8f 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -225,8 +225,12 @@ fsync_mode=%s Control the policy of fsync. Currently supports "posix",
pass, but the performance will regress. "nobarrier" is
based on "posix", but doesn't issue flush command for
non-atomic files likewise "nobarrier" mount option.
-test_dummy_encryption Enable dummy encryption, which provides a fake fscrypt
+test_dummy_encryption
+test_dummy_encryption=%s
+ Enable dummy encryption, which provides a fake fscrypt
context. The fake fscrypt context is used by xfstests.
+ The argument may be either "v1" or "v2", in order to
+ select the corresponding fscrypt policy version.
checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
to reenable checkpointing. Is enabled by default. While
disabled, any unmounting or unexpected shutdowns will cause
@@ -244,7 +248,7 @@ checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enabl
would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
This space is reclaimed once checkpoint=enable.
compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
- "lz4" and "zstd" algorithm.
+ "lz4", "zstd" and "lzo-rle" algorithm.
compress_log_size=%u Support configuring compress cluster size, the size will
be 4KB * (1 << %u), 16KB is minimum size, also it's
default size.
diff --git a/Documentation/filesystems/fiemap.rst b/Documentation/filesystems/fiemap.rst
index 2a572e7edc08..93fc96f760aa 100644
--- a/Documentation/filesystems/fiemap.rst
+++ b/Documentation/filesystems/fiemap.rst
@@ -206,16 +206,18 @@ EINTR once fatal signal received.
Flag checking should be done at the beginning of the ->fiemap callback via the
-fiemap_check_flags() helper::
+fiemap_prep() helper::
- int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
+ int fiemap_prep(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 *len, u32 supported_flags);
The struct fieinfo should be passed in as received from ioctl_fiemap(). The
set of fiemap flags which the fs understands should be passed via fs_flags. If
-fiemap_check_flags finds invalid user flags, it will place the bad values in
+fiemap_prep finds invalid user flags, it will place the bad values in
fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from
-fiemap_check_flags(), it should immediately exit, returning that error back to
-ioctl_fiemap().
+fiemap_prep(), it should immediately exit, returning that error back to
+ioctl_fiemap(). Additionally the range is validate against the supported
+maximum file size.
For each extent in the request range, the file system should call
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index aa072112cfff..f517af8ec11c 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -292,8 +292,22 @@ files' data differently, inode numbers are included in the IVs.
Consequently, shrinking the filesystem may not be allowed.
This format is optimized for use with inline encryption hardware
-compliant with the UFS or eMMC standards, which support only 64 IV
-bits per I/O request and may have only a small number of keyslots.
+compliant with the UFS standard, which supports only 64 IV bits per
+I/O request and may have only a small number of keyslots.
+
+IV_INO_LBLK_32 policies
+-----------------------
+
+IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
+IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
+SipHash key is derived from the master key) and added to the file
+logical block number mod 2^32 to produce a 32-bit IV.
+
+This format is optimized for use with inline encryption hardware
+compliant with the eMMC v5.2 standard, which supports only 32 IV bits
+per I/O request and may have only a small number of keyslots. This
+format results in some level of IV reuse, so it should only be used
+when necessary due to hardware limitations.
Key identifiers
---------------
@@ -369,6 +383,10 @@ a little endian number, except that:
to 32 bits and is placed in bits 0-31 of the IV. The inode number
(which is also limited to 32 bits) is placed in bits 32-63.
+- With `IV_INO_LBLK_32 policies`_, the logical block number is limited
+ to 32 bits and is placed in bits 0-31 of the IV. The inode number
+ is then hashed and added mod 2^32.
+
Note that because file logical block numbers are included in the IVs,
filesystems must enforce that blocks are never shifted around within
encrypted files, e.g. via "collapse range" or "insert range".
@@ -465,8 +483,15 @@ This structure must be initialized as follows:
(0x3).
- FSCRYPT_POLICY_FLAG_DIRECT_KEY: See `DIRECT_KEY policies`_.
- FSCRYPT_POLICY_FLAG_IV_INO_LBLK_64: See `IV_INO_LBLK_64
- policies`_. This is mutually exclusive with DIRECT_KEY and is not
- supported on v1 policies.
+ policies`_.
+ - FSCRYPT_POLICY_FLAG_IV_INO_LBLK_32: See `IV_INO_LBLK_32
+ policies`_.
+
+ v1 encryption policies only support the PAD_* and DIRECT_KEY flags.
+ The other flags are only supported by v2 encryption policies.
+
+ The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
+ mutually exclusive.
- For v2 encryption policies, ``__reserved`` must be zeroed.
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.rst
index 7059623635b2..d14f230f0b12 100644
--- a/Documentation/filesystems/gfs2-glocks.txt
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -1,5 +1,8 @@
- Glock internal locking rules
- ------------------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Glock internal locking rules
+============================
This documents the basic principles of the glock state machine
internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
@@ -24,24 +27,28 @@ There are three lock states that users of the glock layer can request,
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
to the following DLM lock modes:
-Glock mode | DLM lock mode
-------------------------------
- UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
- SH | PR (Protected read)
- DF | CW (Concurrent write)
- EX | EX (Exclusive)
+========== ====== =====================================================
+Glock mode DLM lock mode
+========== ====== =====================================================
+ UN IV/NL Unlocked (no DLM lock associated with glock) or NL
+ SH PR (Protected read)
+ DF CW (Concurrent write)
+ EX EX (Exclusive)
+========== ====== =====================================================
Thus DF is basically a shared mode which is incompatible with the "normal"
shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
---------------------------------------------------------------------------
- UN | No | No | No | No
- SH | Yes | Yes | No | No
- DF | No | Yes | No | No
- EX | Yes | Yes | Yes | Yes
+========== ========== ============== ========== ==============
+Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
+========== ========== ============== ========== ==============
+ UN No No No No
+ SH Yes Yes No No
+ DF No Yes No No
+ EX Yes Yes Yes Yes
+========== ========== ============== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -49,21 +56,23 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-Field | Purpose
-----------------------------------------------------------------------------
-go_xmote_th | Called before remote state change (e.g. to sync dirty data)
-go_xmote_bh | Called after remote state change (e.g. to refill cache)
-go_inval | Called if remote state change requires invalidating the cache
-go_demote_ok | Returns boolean value of whether its ok to demote a glock
- | (e.g. checks timeout, and that there is no cached data)
-go_lock | Called for the first local holder of a lock
-go_unlock | Called on the final local unlock of a lock
-go_dump | Called to print content of object for debugfs file, or on
- | error to dump glock to the log.
-go_type | The type of the glock, LM_TYPE_.....
-go_callback | Called if the DLM sends a callback to drop this lock
-go_flags | GLOF_ASPACE is set, if the glock has an address space
- | associated with it
+============= =============================================================
+Field Purpose
+============= =============================================================
+go_xmote_th Called before remote state change (e.g. to sync dirty data)
+go_xmote_bh Called after remote state change (e.g. to refill cache)
+go_inval Called if remote state change requires invalidating the cache
+go_demote_ok Returns boolean value of whether its ok to demote a glock
+ (e.g. checks timeout, and that there is no cached data)
+go_lock Called for the first local holder of a lock
+go_unlock Called on the final local unlock of a lock
+go_dump Called to print content of object for debugfs file, or on
+ error to dump glock to the log.
+go_type The type of the glock, ``LM_TYPE_*``
+go_callback Called if the DLM sends a callback to drop this lock
+go_flags GLOF_ASPACE is set, if the glock has an address space
+ associated with it
+============= =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
@@ -82,21 +91,25 @@ rather than via the glock.
Locking rules for glock operations:
-Operation | GLF_LOCK bit lock held | gl_lockref.lock spinlock held
--------------------------------------------------------------------------
-go_xmote_th | Yes | No
-go_xmote_bh | Yes | No
-go_inval | Yes | No
-go_demote_ok | Sometimes | Yes
-go_lock | Yes | No
-go_unlock | Yes | No
-go_dump | Sometimes | Yes
-go_callback | Sometimes (N/A) | Yes
-
-N.B. Operations must not drop either the bit lock or the spinlock
-if its held on entry. go_dump and do_demote_ok must never block.
-Note that go_dump will only be called if the glock's state
-indicates that it is caching uptodate data.
+============= ====================== =============================
+Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
+============= ====================== =============================
+go_xmote_th Yes No
+go_xmote_bh Yes No
+go_inval Yes No
+go_demote_ok Sometimes Yes
+go_lock Yes No
+go_unlock Yes No
+go_dump Sometimes Yes
+go_callback Sometimes (N/A) Yes
+============= ====================== =============================
+
+.. Note::
+
+ Operations must not drop either the bit lock or the spinlock
+ if its held on entry. go_dump and do_demote_ok must never block.
+ Note that go_dump will only be called if the glock's state
+ indicates that it is caching uptodate data.
Glock locking order within GFS2:
@@ -104,7 +117,7 @@ Glock locking order within GFS2:
2. Rename glock (for rename only)
3. Inode glock(s)
(Parents before children, inodes at "same level" with same parent in
- lock number order)
+ lock number order)
4. Rgrp glock(s) (for (de)allocation operations)
5. Transaction glock (via gfs2_trans_begin) for non-read operations
6. i_rw_mutex (if required)
@@ -117,8 +130,8 @@ determine the lifetime of the inode in question. Locking of inodes
is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
In general we prefer to lock local locks prior to cluster locks.
- Glock Statistics
- ------------------
+Glock Statistics
+----------------
The stats are divided into two sets: those relating to the
super block and those relating to an individual glock. The
@@ -173,8 +186,8 @@ we'd like to get a better idea of these timings:
1. To be able to better set the glock "min hold time"
2. To spot performance issues more easily
3. To improve the algorithm for selecting resource groups for
-allocation (to base it on lock wait time, rather than blindly
-using a "try lock")
+ allocation (to base it on lock wait time, rather than blindly
+ using a "try lock")
Due to the smoothing action of the updates, a step change in
some input quantity being sampled will only fully be taken
@@ -195,10 +208,13 @@ as possible. There are always inaccuracies in any
measuring system, but I hope this is as accurate as we
can reasonably make it.
-Per sb stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/sbstats
-Per glock stats can be found here:
-/sys/kernel/debug/gfs2/<fsname>/glstats
+Per sb stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/sbstats
+
+Per glock stats can be found here::
+
+ /sys/kernel/debug/gfs2/<fsname>/glstats
Assuming that debugfs is mounted on /sys/kernel/debug and also
that <fsname> is replaced with the name of the gfs2 filesystem
@@ -206,14 +222,16 @@ in question.
The abbreviations used in the output as are follows:
-srtt - Smoothed round trip time for non-blocking dlm requests
-srttvar - Variance estimate for srtt
-srttb - Smoothed round trip time for (potentially) blocking dlm requests
-srttvarb - Variance estimate for srttb
-sirt - Smoothed inter-request time (for dlm requests)
-sirtvar - Variance estimate for sirt
-dlm - Number of dlm requests made (dcnt in glstats file)
-queue - Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
+srtt Smoothed round trip time for non blocking dlm requests
+srttvar Variance estimate for srtt
+srttb Smoothed round trip time for (potentially) blocking dlm requests
+srttvarb Variance estimate for srttb
+sirt Smoothed inter request time (for dlm requests)
+sirtvar Variance estimate for sirt
+dlm Number of dlm requests made (dcnt in glstats file)
+queue Number of glock requests queued (qcnt in glstats file)
+========= ================================================================
The sbstats file contains a set of these stats for each glock type (so 8 lines
for each type) and for each cpu (one column per cpu). The glstats file contains
@@ -224,9 +242,12 @@ The gfs2_glock_lock_time tracepoint prints out the current values of the stats
for the glock in question, along with some addition information on each dlm
reply that is received:
-status - The status of the dlm request
-flags - The dlm request flags
-tdiff - The time taken by this specific request
+====== =======================================
+status The status of the dlm request
+flags The dlm request flags
+tdiff The time taken by this specific request
+====== =======================================
+
(remaining fields as per above list)
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 17795341e0a3..4c536e66dc4c 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -88,6 +88,7 @@ Documentation for filesystem implementations.
f2fs
gfs2
gfs2-uevents
+ gfs2-glocks
hfs
hfsplus
hpfs
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..eb71156bcb7c 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -239,6 +239,7 @@ prototypes::
int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
+ void (*readahead)(struct readahead_control *);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +272,8 @@ writepage: yes, unlocks (see below)
readpage: yes, unlocks
writepages:
set_page_dirty no
-readpages:
+readahead: yes, unlocks
+readpages: no
write_begin: locks the page exclusive
write_end: yes, unlocks exclusive
bmap:
@@ -295,6 +297,8 @@ the request handler (/dev/loop).
->readpage() unlocks the page, either synchronously or via I/O
completion.
+->readahead() unlocks the pages that I/O is attempted on like ->readpage().
+
->readpages() populates the pagecache with the passed pages and starts
I/O against them. They come unlocked upon I/O completion.
@@ -611,7 +615,7 @@ prototypes::
locking rules:
============= ======== ===========================
-ops mmap_sem PageLocked(page)
+ops mmap_lock PageLocked(page)
============= ======== ===========================
open: yes
close: yes
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index c9d2bf96b02d..660dbaf0b9b8 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -365,8 +365,8 @@ pointed by REDIRECT. This should not be possible on local system as setting
"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible
for untrusted layers like from a pen drive.
-Note: redirect_dir={off|nofollow|follow[*]} conflicts with metacopy=on, and
-results in an error.
+Note: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options
+conflict with metacopy=on, and will result in an error.
[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is
given.
@@ -560,6 +560,9 @@ When the NFS export feature is enabled, all directory index entries are
verified on mount time to check that upper file handles are not stale.
This verification may cause significant overhead in some cases.
+Note: the mount options index=off,nfs_export=on are conflicting and will
+result in an error.
+
Testsuite
---------
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 26c093969573..867036aa90b8 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -858,3 +858,10 @@ be misspelled d_alloc_anon().
[should've been added in 2016] stale comment in finish_open() nonwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
+
+---
+
+**mandatory**
+
+clone_private_mount() returns a longterm mount now, so the proper destructor of
+its result is kern_unmount() or kern_unmount_array().
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index a567bcccbb02..996f3cfe7030 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -51,6 +51,8 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
4 Configuring procfs
4.1 Mount options
+ 5 Filesystem behavior
+
Preface
=======
@@ -543,6 +545,7 @@ encoded manner. The codes are the following:
hg huge page advise flag
nh no huge page advise flag
mg mergable advise flag
+ bt - arm64 BTI guarded page
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
@@ -1042,8 +1045,8 @@ PageTables
amount of memory dedicated to the lowest level of page
tables.
NFS_Unstable
- NFS pages sent to the server, but not yet committed to stable
- storage
+ Always zero. Previous counted pages which had been written to
+ the server, but has not been committed to stable storage.
Bounce
Memory used for block device "bounce buffers"
WritebackTmp
@@ -2142,28 +2145,80 @@ The following mount options are supported:
========= ========================================================
hidepid= Set /proc/<pid>/ access mode.
gid= Set the group authorized to learn processes information.
+ subset= Show only the specified subset of procfs.
========= ========================================================
-hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
-(default).
-
-hidepid=1 means users may not access any /proc/<pid>/ directories but their
-own. Sensitive files like cmdline, sched*, status are now protected against
-other users. This makes it impossible to learn whether any user runs
-specific program (given the program doesn't reveal itself by its behaviour).
-As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
-poorly written programs passing sensitive information via program arguments are
-now protected against local eavesdroppers.
-
-hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
-users. It doesn't mean that it hides a fact whether a process with a specific
-pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
-but it hides process' uid and gid, which may be learned by stat()'ing
-/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering
-information about running processes, whether some daemon runs with elevated
-privileges, whether other user runs some sensitive program, whether other users
-run any program at all, etc.
+hidepid=off or hidepid=0 means classic mode - everybody may access all
+/proc/<pid>/ directories (default).
+
+hidepid=noaccess or hidepid=1 means users may not access any /proc/<pid>/
+directories but their own. Sensitive files like cmdline, sched*, status are now
+protected against other users. This makes it impossible to learn whether any
+user runs specific program (given the program doesn't reveal itself by its
+behaviour). As an additional bonus, as /proc/<pid>/cmdline is unaccessible for
+other users, poorly written programs passing sensitive information via program
+arguments are now protected against local eavesdroppers.
+
+hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
+fully invisible to other users. It doesn't mean that it hides a fact whether a
+process with a specific pid value exists (it can be learned by other means, e.g.
+by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
+gathering information about running processes, whether some daemon runs with
+elevated privileges, whether other user runs some sensitive program, whether
+other users run any program at all, etc.
+
+hidepid=ptraceable or hidepid=4 means that procfs should only contain
+/proc/<pid>/ directories that the caller can ptrace.
gid= defines a group authorized to learn processes information otherwise
prohibited by hidepid=. If you use some daemon like identd which needs to learn
information about processes information, just add identd to this group.
+
+subset=pid hides all top level files and directories in the procfs that
+are not related to tasks.
+
+5 Filesystem behavior
+----------------------------
+
+Originally, before the advent of pid namepsace, procfs was a global file
+system. It means that there was only one procfs instance in the system.
+
+When pid namespace was added, a separate procfs instance was mounted in
+each pid namespace. So, procfs mount options are global among all
+mountpoints within the same namespace.
+
+::
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+
+# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
++++ exited with 0 +++
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+
+and only after remounting procfs mount options will change at all
+mountpoints.
+
+# mount -o remount,hidepid=1 -t proc proc /tmp/proc
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=1 0 0
+proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+
+This behavior is different from the behavior of other filesystems.
+
+The new procfs behavior is more like other filesystems. Each procfs mount
+creates a new procfs instance. Mount options affect own procfs instance.
+It means that it became possible to have several procfs instances
+displaying tasks with different filtering options in one pid namespace.
+
+# mount -o hidepid=invisible -t proc proc /proc
+# mount -o hidepid=noaccess -t proc proc /tmp/proc
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=invisible 0 0
+proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..ed17771c212b 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -706,6 +706,7 @@ cache in your filesystem. The following members are defined:
int (*readpage)(struct file *, struct page *);
int (*writepages)(struct address_space *, struct writeback_control *);
int (*set_page_dirty)(struct page *page);
+ void (*readahead)(struct readahead_control *);
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
@@ -781,12 +782,26 @@ cache in your filesystem. The following members are defined:
If defined, it should set the PageDirty flag, and the
PAGECACHE_TAG_DIRTY tag in the radix tree.
+``readahead``
+ Called by the VM to read pages associated with the address_space
+ object. The pages are consecutive in the page cache and are
+ locked. The implementation should decrement the page refcount
+ after starting I/O on each page. Usually the page will be
+ unlocked by the I/O completion handler. If the filesystem decides
+ to stop attempting I/O before reaching the end of the readahead
+ window, it can simply return. The caller will decrement the page
+ refcount and unlock the remaining pages for you. Set PageUptodate
+ if the I/O completes successfully. Setting PageError on any page
+ will be ignored; simply unlock the page if an I/O error occurs.
+
``readpages``
called by the VM to read pages associated with the address_space
object. This is essentially just a vector version of readpage.
Instead of just one page, several pages are requested.
readpages is only used for read-ahead, so read errors are
ignored. If anything goes wrong, feel free to give up.
+ This interface is deprecated and will be removed by the end of
+ 2020; implement readahead instead.
``write_begin``
Called by the generic buffered write code to ask the filesystem
diff --git a/Documentation/filesystems/virtiofs.rst b/Documentation/filesystems/virtiofs.rst
index e06e4951cb39..fd4d2484e949 100644
--- a/Documentation/filesystems/virtiofs.rst
+++ b/Documentation/filesystems/virtiofs.rst
@@ -39,6 +39,20 @@ Mount file system with tag ``myfs`` on ``/mnt``:
Please see https://virtio-fs.gitlab.io/ for details on how to configure QEMU
and the virtiofsd daemon.
+Mount options
+-------------
+
+virtiofs supports general VFS mount options, for example, remount,
+ro, rw, context, etc. It also supports FUSE mount options.
+
+atime behavior
+^^^^^^^^^^^^^^
+
+The atime-related mount options, for example, noatime, strictatime,
+are ignored. The atime behavior for virtiofs is the same as the
+underlying filesystem of the directory that has been exported
+on the host.
+
Internals
=========
Since the virtio-fs device uses the FUSE protocol for file system requests, the
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs-self-describing-metadata.rst
index 51cdafe01ab1..b79dbf36dc94 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs-self-describing-metadata.rst
@@ -340,11 +340,11 @@ buffer.
The structure of the verifiers and the identifiers checks is very similar to the
buffer code described above. The only difference is where they are called. For
-example, inode read verification is done in xfs_iread() when the inode is first
-read out of the buffer and the struct xfs_inode is instantiated. The inode is
-already extensively verified during writeback in xfs_iflush_int, so the only
-addition here is to add the LSN and CRC to the inode as it is copied back into
-the buffer.
+example, inode read verification is done in xfs_inode_from_disk() when the inode
+is first read out of the buffer and the struct xfs_inode is instantiated. The
+inode is already extensively verified during writeback in xfs_iflush_int, so the
+only addition here is to add the LSN and CRC to the inode as it is copied back
+into the buffer.
XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
the unlinked list modifications check or update CRCs, neither during unlink nor