summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/stable/sysfs-block33
-rw-r--r--Documentation/ABI/testing/debugfs-pktcdvd18
-rw-r--r--Documentation/ABI/testing/sysfs-class-pktcdvd97
-rw-r--r--Documentation/ABI/testing/sysfs-fs-erofs10
-rw-r--r--Documentation/admin-guide/LSM/SELinux.rst11
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt8
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst4
-rw-r--r--Documentation/arch/x86/x86_64/mm.rst2
-rw-r--r--Documentation/cdrom/cdrom-standard.rst1
-rw-r--r--Documentation/cdrom/index.rst1
-rw-r--r--Documentation/cdrom/packet-writing.rst139
-rw-r--r--Documentation/core-api/kernel-api.rst18
-rw-r--r--Documentation/driver-api/libata.rst25
-rw-r--r--Documentation/filesystems/fscrypt.rst45
-rw-r--r--Documentation/filesystems/fsverity.rst3
-rw-r--r--Documentation/filesystems/iomap/design.rst3
-rw-r--r--Documentation/filesystems/iomap/operations.rst57
-rw-r--r--Documentation/filesystems/locking.rst8
-rw-r--r--Documentation/filesystems/porting.rst33
-rw-r--r--Documentation/filesystems/propagate_umount.txt484
-rw-r--r--Documentation/filesystems/vfs.rst37
-rw-r--r--Documentation/nvme/nvme-pci-endpoint-target.rst22
-rw-r--r--Documentation/security/credentials.rst2
-rw-r--r--Documentation/security/self-protection.rst2
-rw-r--r--Documentation/translations/zh_CN/security/credentials.rst2
-rw-r--r--Documentation/translations/zh_CN/security/self-protection.rst2
-rw-r--r--Documentation/userspace-api/ioctl/ioctl-number.rst1
27 files changed, 693 insertions, 375 deletions
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 4ba771b56b3b..803f578dc023 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -778,6 +778,39 @@ Description:
0, write zeroes is not supported by the device.
+What: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes
+Date: January 2025
+Contact: Zhang Yi <yi.zhang@huawei.com>
+Description:
+ [RO] This file indicates whether a device supports zeroing data
+ in a specified block range without incurring the cost of
+ physically writing zeroes to the media for each individual
+ block. If this parameter is set to write_zeroes_max_bytes, the
+ device implements a zeroing operation which opportunistically
+ avoids writing zeroes to media while still guaranteeing that
+ subsequent reads from the specified block range will return
+ zeroed data. This operation is a best-effort optimization, a
+ device may fall back to physically writing zeroes to the media
+ due to other factors such as misalignment or being asked to
+ clear a block range smaller than the device's internal
+ allocation unit. If this parameter is set to 0, the device may
+ have to write each logical block media during a zeroing
+ operation.
+
+
+What: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes
+Date: January 2025
+Contact: Zhang Yi <yi.zhang@huawei.com>
+Description:
+ [RW] While write_zeroes_unmap_max_hw_bytes is the hardware limit
+ for the device, this setting is the software limit. Since the
+ unmap write zeroes operation is a best-effort optimization, some
+ devices may still physically writing zeroes to media. So the
+ speed of this operation is not guaranteed. Writing a value of
+ '0' to this file disables this operation. Otherwise, this
+ parameter should be equal to write_zeroes_unmap_max_hw_bytes.
+
+
What: /sys/block/<disk>/queue/zone_append_max_bytes
Date: May 2020
Contact: linux-block@vger.kernel.org
diff --git a/Documentation/ABI/testing/debugfs-pktcdvd b/Documentation/ABI/testing/debugfs-pktcdvd
deleted file mode 100644
index f6f65a4faea0..000000000000
--- a/Documentation/ABI/testing/debugfs-pktcdvd
+++ /dev/null
@@ -1,18 +0,0 @@
-What: /sys/kernel/debug/pktcdvd/pktcdvd[0-7]
-Date: Oct. 2006
-KernelVersion: 2.6.20
-Contact: Thomas Maier <balagi@justmail.de>
-Description:
-
-The pktcdvd module (packet writing driver) creates
-these files in debugfs:
-
-/sys/kernel/debug/pktcdvd/pktcdvd[0-7]/
-
- ==== ====== ====================================
- info 0444 Lots of driver statistics and infos.
- ==== ====== ====================================
-
-Example::
-
- cat /sys/kernel/debug/pktcdvd/pktcdvd0/info
diff --git a/Documentation/ABI/testing/sysfs-class-pktcdvd b/Documentation/ABI/testing/sysfs-class-pktcdvd
deleted file mode 100644
index ba1ce626591d..000000000000
--- a/Documentation/ABI/testing/sysfs-class-pktcdvd
+++ /dev/null
@@ -1,97 +0,0 @@
-sysfs interface
----------------
-The pktcdvd module (packet writing driver) creates the following files in the
-sysfs: (<devid> is in the format major:minor)
-
-What: /sys/class/pktcdvd/add
-What: /sys/class/pktcdvd/remove
-What: /sys/class/pktcdvd/device_map
-Date: Oct. 2006
-KernelVersion: 2.6.20
-Contact: Thomas Maier <balagi@justmail.de>
-Description:
-
- ========== ==============================================
- add (WO) Write a block device id (major:minor) to
- create a new pktcdvd device and map it to the
- block device.
-
- remove (WO) Write the pktcdvd device id (major:minor)
- to remove the pktcdvd device.
-
- device_map (RO) Shows the device mapping in format:
- pktcdvd[0-7] <pktdevid> <blkdevid>
- ========== ==============================================
-
-
-What: /sys/class/pktcdvd/pktcdvd[0-7]/dev
-What: /sys/class/pktcdvd/pktcdvd[0-7]/uevent
-Date: Oct. 2006
-KernelVersion: 2.6.20
-Contact: Thomas Maier <balagi@justmail.de>
-Description:
- dev: (RO) Device id
-
- uevent: (WO) To send a uevent
-
-
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_started
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/packets_finished
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_written
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/kb_read_gather
-What: /sys/class/pktcdvd/pktcdvd[0-7]/stat/reset
-Date: Oct. 2006
-KernelVersion: 2.6.20
-Contact: Thomas Maier <balagi@justmail.de>
-Description:
- packets_started: (RO) Number of started packets.
-
- packets_finished: (RO) Number of finished packets.
-
- kb_written: (RO) kBytes written.
-
- kb_read: (RO) kBytes read.
-
- kb_read_gather: (RO) kBytes read to fill write packets.
-
- reset: (WO) Write any value to it to reset
- pktcdvd device statistic values, like
- bytes read/written.
-
-
-What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/size
-What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_off
-What: /sys/class/pktcdvd/pktcdvd[0-7]/write_queue/congestion_on
-Date: Oct. 2006
-KernelVersion: 2.6.20
-Contact: Thomas Maier <balagi@justmail.de>
-Description:
- ============== ================================================
- size (RO) Contains the size of the bio write queue.
-
- congestion_off (RW) If bio write queue size is below this mark,
- accept new bio requests from the block layer.
-
- congestion_on (RW) If bio write queue size is higher as this
- mark, do no longer accept bio write requests
- from the block layer and wait till the pktcdvd
- device has processed enough bio's so that bio
- write queue size is below congestion off mark.
- A value of <= 0 disables congestion control.
- ============== ================================================
-
-
-Example:
---------
-To use the pktcdvd sysfs interface directly, you can do::
-
- # create a new pktcdvd device mapped to /dev/hdc
- echo "22:0" >/sys/class/pktcdvd/add
- cat /sys/class/pktcdvd/device_map
- # assuming device pktcdvd0 was created, look at stat's
- cat /sys/class/pktcdvd/pktcdvd0/stat/kb_written
- # print the device id of the mapped block device
- fgrep pktcdvd0 /sys/class/pktcdvd/device_map
- # remove device, using pktcdvd0 device id 253:0
- echo "253:0" >/sys/class/pktcdvd/remove
diff --git a/Documentation/ABI/testing/sysfs-fs-erofs b/Documentation/ABI/testing/sysfs-fs-erofs
index bf3b6299c15e..76d9808ed581 100644
--- a/Documentation/ABI/testing/sysfs-fs-erofs
+++ b/Documentation/ABI/testing/sysfs-fs-erofs
@@ -5,7 +5,7 @@ Description: Shows all enabled kernel features.
Supported features:
zero_padding, compr_cfgs, big_pcluster, chunked_file,
device_table, compr_head2, sb_chksum, ztailpacking,
- dedupe, fragments.
+ dedupe, fragments, 48bit, metabox.
What: /sys/fs/erofs/<disk>/sync_decompress
Date: November 2021
@@ -35,3 +35,11 @@ Description: Used to set or show hardware accelerators in effect
and multiple accelerators are separated by '\n'.
Supported accelerator(s): qat_deflate.
Disable all accelerators with an empty string (echo > accel).
+
+What: /sys/fs/erofs/<disk>/dir_ra_bytes
+Date: July 2025
+Contact: "Chao Yu" <chao@kernel.org>
+Description: Used to set or show readahead bytes during readdir(), by
+ default the value is 16384.
+
+ - 0: disable readahead.
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst
index 520a1c2c6fd2..cdd65164ca96 100644
--- a/Documentation/admin-guide/LSM/SELinux.rst
+++ b/Documentation/admin-guide/LSM/SELinux.rst
@@ -2,6 +2,17 @@
SELinux
=======
+Information about the SELinux kernel subsystem can be found at the
+following links:
+
+ https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/tree/README.md
+
+ https://github.com/selinuxproject/selinux-kernel/wiki
+
+Information about the SELinux userspace can be found at:
+
+ https://github.com/SELinuxProject/selinux/wiki
+
If you want to use SELinux, chances are you will want
to use the distro-provided policies, or install the
latest reference policy release from
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 07e22ba5bfe3..343377538fe9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7214,6 +7214,14 @@
causing a major performance hit, and the space where
machines are deployed is by other means guarded.
+ tpm_crb_ffa.busy_timeout_ms= [ARM64,TPM]
+ Maximum time in milliseconds to retry sending a message
+ to the TPM service before giving up. This parameter controls
+ how long the system will continue retrying when the TPM
+ service is busy.
+ Format: <unsigned int>
+ Default: 2000 (2 seconds)
+
tpm_suspend_pcr=[HW,TPM]
Format: integer pcr id
Specify that at suspend time, the tpm driver
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index dd49a89a62d3..19224eeac1c2 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1465,7 +1465,7 @@ stack_erasing
=============
This parameter can be used to control kernel stack erasing at the end
-of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``.
+of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``.
That erasing reduces the information which kernel stack leak bugs
can reveal and blocks some uninitialized stack variable attacks.
@@ -1473,7 +1473,7 @@ The tradeoff is the performance impact: on a single CPU system kernel
compilation sees a 1% slowdown, other systems and workloads may vary.
= ====================================================================
-0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated.
+0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated.
1 Kernel stack erasing is enabled (default), it is performed before
returning to the userspace at the end of syscalls.
= ====================================================================
diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
index f2db178b353f..a6cf05d51bd8 100644
--- a/Documentation/arch/x86/x86_64/mm.rst
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -176,5 +176,5 @@ Be very careful vs. KASLR when changing anything here. The KASLR address
range must not overlap with anything except the KASAN shadow area, which is
correct as KASAN disables KASLR.
-For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB
+For both 4- and 5-level layouts, the KSTACK_ERASE_POISON value in the last 2MB
hole: ffffffffffff4111
diff --git a/Documentation/cdrom/cdrom-standard.rst b/Documentation/cdrom/cdrom-standard.rst
index 6c1303cff159..b97a4e9b9bd3 100644
--- a/Documentation/cdrom/cdrom-standard.rst
+++ b/Documentation/cdrom/cdrom-standard.rst
@@ -273,7 +273,6 @@ The drive-specific, minor-like information that is registered with
__u8 media_written; /* dirty flag, DVD+RW bookkeeping */
unsigned short mmc3_profile; /* current MMC3 profile */
int for_data; /* unknown:TBD */
- int (*exit)(struct cdrom_device_info *);/* unknown:TBD */
int mrw_mode_page; /* which MRW mode page is in use */
};
diff --git a/Documentation/cdrom/index.rst b/Documentation/cdrom/index.rst
index e9b022d70939..3ac4f716612f 100644
--- a/Documentation/cdrom/index.rst
+++ b/Documentation/cdrom/index.rst
@@ -8,7 +8,6 @@ CD-ROM
:maxdepth: 1
cdrom-standard
- packet-writing
.. only:: subproject and html
diff --git a/Documentation/cdrom/packet-writing.rst b/Documentation/cdrom/packet-writing.rst
deleted file mode 100644
index 43db58c50d29..000000000000
--- a/Documentation/cdrom/packet-writing.rst
+++ /dev/null
@@ -1,139 +0,0 @@
-==============
-Packet writing
-==============
-
-Getting started quick
----------------------
-
-- Select packet support in the block device section and UDF support in
- the file system section.
-
-- Compile and install kernel and modules, reboot.
-
-- You need the udftools package (pktsetup, mkudffs, cdrwtool).
- Download from https://github.com/pali/udftools
-
-- Grab a new CD-RW disc and format it (assuming CD-RW is hdc, substitute
- as appropriate)::
-
- # cdrwtool -d /dev/hdc -q
-
-- Setup your writer::
-
- # pktsetup dev_name /dev/hdc
-
-- Now you can mount /dev/pktcdvd/dev_name and copy files to it. Enjoy::
-
- # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime
-
-
-Packet writing for DVD-RW media
--------------------------------
-
-DVD-RW discs can be written to much like CD-RW discs if they are in
-the so called "restricted overwrite" mode. To put a disc in restricted
-overwrite mode, run::
-
- # dvd+rw-format /dev/hdc
-
-You can then use the disc the same way you would use a CD-RW disc::
-
- # pktsetup dev_name /dev/hdc
- # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime
-
-
-Packet writing for DVD+RW media
--------------------------------
-
-According to the DVD+RW specification, a drive supporting DVD+RW discs
-shall implement "true random writes with 2KB granularity", which means
-that it should be possible to put any filesystem with a block size >=
-2KB on such a disc. For example, it should be possible to do::
-
- # dvd+rw-format /dev/hdc (only needed if the disc has never
- been formatted)
- # mkudffs /dev/hdc
- # mount /dev/hdc /cdrom -t udf -o rw,noatime
-
-However, some drives don't follow the specification and expect the
-host to perform aligned writes at 32KB boundaries. Other drives do
-follow the specification, but suffer bad performance problems if the
-writes are not 32KB aligned.
-
-Both problems can be solved by using the pktcdvd driver, which always
-generates aligned writes::
-
- # dvd+rw-format /dev/hdc
- # pktsetup dev_name /dev/hdc
- # mkudffs /dev/pktcdvd/dev_name
- # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime
-
-
-Packet writing for DVD-RAM media
---------------------------------
-
-DVD-RAM discs are random writable, so using the pktcdvd driver is not
-necessary. However, using the pktcdvd driver can improve performance
-in the same way it does for DVD+RW media.
-
-
-Notes
------
-
-- CD-RW media can usually not be overwritten more than about 1000
- times, so to avoid unnecessary wear on the media, you should always
- use the noatime mount option.
-
-- Defect management (ie automatic remapping of bad sectors) has not
- been implemented yet, so you are likely to get at least some
- filesystem corruption if the disc wears out.
-
-- Since the pktcdvd driver makes the disc appear as a regular block
- device with a 2KB block size, you can put any filesystem you like on
- the disc. For example, run::
-
- # /sbin/mke2fs /dev/pktcdvd/dev_name
-
- to create an ext2 filesystem on the disc.
-
-
-Using the pktcdvd sysfs interface
----------------------------------
-
-Since Linux 2.6.20, the pktcdvd module has a sysfs interface
-and can be controlled by it. For example the "pktcdvd" tool uses
-this interface. (see http://tom.ist-im-web.de/linux/software/pktcdvd )
-
-"pktcdvd" works similar to "pktsetup", e.g.::
-
- # pktcdvd -a dev_name /dev/hdc
- # mkudffs /dev/pktcdvd/dev_name
- # mount -t udf -o rw,noatime /dev/pktcdvd/dev_name /dvdram
- # cp files /dvdram
- # umount /dvdram
- # pktcdvd -r dev_name
-
-
-For a description of the sysfs interface look into the file:
-
- Documentation/ABI/testing/sysfs-class-pktcdvd
-
-
-Using the pktcdvd debugfs interface
------------------------------------
-
-To read pktcdvd device infos in human readable form, do::
-
- # cat /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/info
-
-For a description of the debugfs interface look into the file:
-
- Documentation/ABI/testing/debugfs-pktcdvd
-
-
-
-Links
------
-
-See http://fy.chalmers.se/~appro/linux/DVD+RW/ for more information
-about DVD writing.
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index ae92a2571388..111f6a595e48 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -136,26 +136,28 @@ Arithmetic Overflow Checking
CRC Functions
-------------
-.. kernel-doc:: lib/crc4.c
+.. kernel-doc:: lib/crc/crc4.c
:export:
-.. kernel-doc:: lib/crc7.c
+.. kernel-doc:: lib/crc/crc7.c
:export:
-.. kernel-doc:: lib/crc8.c
+.. kernel-doc:: lib/crc/crc8.c
:export:
-.. kernel-doc:: lib/crc16.c
+.. kernel-doc:: lib/crc/crc16.c
:export:
-.. kernel-doc:: lib/crc32.c
-
-.. kernel-doc:: lib/crc-ccitt.c
+.. kernel-doc:: lib/crc/crc-ccitt.c
:export:
-.. kernel-doc:: lib/crc-itu-t.c
+.. kernel-doc:: lib/crc/crc-itu-t.c
:export:
+.. kernel-doc:: include/linux/crc32.h
+
+.. kernel-doc:: include/linux/crc64.h
+
Base 2 log and power Functions
------------------------------
diff --git a/Documentation/driver-api/libata.rst b/Documentation/driver-api/libata.rst
index 5da27a749246..93d97fe78e3f 100644
--- a/Documentation/driver-api/libata.rst
+++ b/Documentation/driver-api/libata.rst
@@ -283,18 +283,25 @@ interrupts, start DMA engine, etc.
``->error_handler()`` is a driver's hook into probe, hotplug, and recovery
and other exceptional conditions. The primary responsibility of an
-implementation is to call :c:func:`ata_do_eh` or :c:func:`ata_bmdma_drive_eh`
-with a set of EH hooks as arguments:
+implementation is to call :c:func:`ata_std_error_handler`.
-'prereset' hook (may be NULL) is called during an EH reset, before any
-other actions are taken.
+:c:func:`ata_std_error_handler` will perform a standard error handling sequence
+to resurect failed devices, detach lost devices and add new devices (if any).
+This function will call the various reset operations for a port, as needed.
+These operations are as follows.
-'postreset' hook (may be NULL) is called after the EH reset is
-performed. Based on existing conditions, severity of the problem, and
-hardware capabilities,
+* The 'prereset' operation (which may be NULL) is called during an EH reset,
+ before any other action is taken.
-Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be
-called to perform the low-level EH reset.
+* The 'postreset' hook (which may be NULL) is called after the EH reset is
+ performed. Based on existing conditions, severity of the problem, and hardware
+ capabilities,
+
+* Either the 'softreset' operation or the 'hardreset' operation will be called
+ to perform the low-level EH reset. If both operations are defined,
+ 'hardreset' is preferred and used. If both are not defined, no low-level reset
+ is performed and EH assumes that an ATA class device is connected through the
+ link.
::
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 29e84d125e02..696a5844bfa3 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -147,9 +147,8 @@ However, these ioctls have some limitations:
were wiped. To partially solve this, you can add init_on_free=1 to
your kernel command line. However, this has a performance cost.
-- Secret keys might still exist in CPU registers, in crypto
- accelerator hardware (if used by the crypto API to implement any of
- the algorithms), or in other places not explicitly considered here.
+- Secret keys might still exist in CPU registers or in other places
+ not explicitly considered here.
Full system compromise
~~~~~~~~~~~~~~~~~~~~~~
@@ -406,9 +405,12 @@ the work is done by XChaCha12, which is much faster than AES when AES
acceleration is unavailable. For more information about Adiantum, see
`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
-The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support
-systems whose only form of AES acceleration is an off-CPU crypto
-accelerator such as CAAM or CESA that does not support XTS.
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair was added to try to
+provide a more efficient option for systems that lack AES instructions
+in the CPU but do have a non-inline crypto engine such as CAAM or CESA
+that supports AES-CBC (and not AES-XTS). This is deprecated. It has
+been shown that just doing AES on the CPU is actually faster.
+Moreover, Adiantum is faster still and is recommended on such systems.
The remaining mode pairs are the "national pride ciphers":
@@ -468,14 +470,6 @@ API, but the filenames mode still does.
- Recommended:
- AES-CBC acceleration
-fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512
-acceleration is recommended:
-
-- SHA-512
- - Recommended:
- - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE
- - x86: CONFIG_CRYPTO_SHA512_SSSE3
-
Contents encryption
-------------------
@@ -1326,22 +1320,13 @@ this by validating all top-level encryption policies prior to access.
Inline encryption support
=========================
-By default, fscrypt uses the kernel crypto API for all cryptographic
-operations (other than HKDF, which fscrypt partially implements
-itself). The kernel crypto API supports hardware crypto accelerators,
-but only ones that work in the traditional way where all inputs and
-outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can
-take advantage of such hardware, but the traditional acceleration
-model isn't particularly efficient and fscrypt hasn't been optimized
-for it.
-
-Instead, many newer systems (especially mobile SoCs) have *inline
-encryption hardware* that can encrypt/decrypt data while it is on its
-way to/from the storage device. Linux supports inline encryption
-through a set of extensions to the block layer called *blk-crypto*.
-blk-crypto allows filesystems to attach encryption contexts to bios
-(I/O requests) to specify how the data will be encrypted or decrypted
-in-line. For more information about blk-crypto, see
+Many newer systems (especially mobile SoCs) have *inline encryption
+hardware* that can encrypt/decrypt data while it is on its way to/from
+the storage device. Linux supports inline encryption through a set of
+extensions to the block layer called *blk-crypto*. blk-crypto allows
+filesystems to attach encryption contexts to bios (I/O requests) to
+specify how the data will be encrypted or decrypted in-line. For more
+information about blk-crypto, see
:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`.
On supported filesystems (currently ext4 and f2fs), fscrypt can use
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index dacdbc1149e6..412cf11e3298 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -185,8 +185,7 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
needed to verify the builtin signature
- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
- available in the kernel's crypto API as currently configured (e.g.
- for SHA-512, missing CONFIG_CRYPTO_SHA512).
+ available in the kernel as currently configured
- ``ENOTTY``: this type of filesystem does not implement fs-verity
- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
support; or the filesystem superblock has not had the 'verity'
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
index f2df9b6df988..0f7672676c0b 100644
--- a/Documentation/filesystems/iomap/design.rst
+++ b/Documentation/filesystems/iomap/design.rst
@@ -167,7 +167,6 @@ structure below:
struct dax_device *dax_dev;
void *inline_data;
void *private;
- const struct iomap_folio_ops *folio_ops;
u64 validity_cookie;
};
@@ -292,8 +291,6 @@ The fields are as follows:
<https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
This value will be passed unchanged to ``->iomap_end``.
- * ``folio_ops`` will be covered in the section on pagecache operations.
-
* ``validity_cookie`` is a magic freshness value set by the filesystem
that should be used to detect stale mappings.
For pagecache operations this is critical for correct operation
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 3b628e370d88..067ed8e14ef3 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -57,21 +57,19 @@ The following address space operations can be wrapped easily:
* ``bmap``
* ``swap_activate``
-``struct iomap_folio_ops``
+``struct iomap_write_ops``
--------------------------
-The ``->iomap_begin`` function for pagecache operations may set the
-``struct iomap::folio_ops`` field to an ops structure to override
-default behaviors of iomap:
-
.. code-block:: c
- struct iomap_folio_ops {
+ struct iomap_write_ops {
struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
unsigned len);
void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
struct folio *folio);
bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ int (*read_folio_range)(const struct iomap_iter *iter,
+ struct folio *folio, loff_t pos, size_t len);
};
iomap calls these functions:
@@ -127,6 +125,10 @@ iomap calls these functions:
``->iomap_valid``, then the iomap should considered stale and the
validation failed.
+ - ``read_folio_range``: Called to synchronously read in the range that will
+ be written to. If this function is not provided, iomap will default to
+ submitting a bio read request.
+
These ``struct kiocb`` flags are significant for buffered I/O with iomap:
* ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
@@ -271,7 +273,7 @@ writeback.
It does not lock ``i_rwsem`` or ``invalidate_lock``.
The dirty bit will be cleared for all folios run through the
-``->map_blocks`` machinery described below even if the writeback fails.
+``->writeback_range`` machinery described below even if the writeback fails.
This is to prevent dirty folio clots when storage devices fail; an
``-EIO`` is recorded for userspace to collect via ``fsync``.
@@ -283,15 +285,14 @@ The ``ops`` structure must be specified and is as follows:
.. code-block:: c
struct iomap_writeback_ops {
- int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
- loff_t offset, unsigned len);
- int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
- void (*discard_folio)(struct folio *folio, loff_t pos);
+ int (*writeback_range)(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
+ int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
};
The fields are as follows:
- - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
range (in bytes) given by ``offset`` and ``len``.
iomap calls this function for each dirty fs block in each dirty folio,
though it will `reuse mappings
@@ -306,27 +307,26 @@ The fields are as follows:
This revalidation must be open-coded by the filesystem; it is
unclear if ``iomap::validity_cookie`` can be reused for this
purpose.
- This function must be supplied by the filesystem.
-
- - ``submit_ioend``: Allows the file systems to hook into writeback bio
- submission.
- This might include pre-write space accounting updates, or installing
- a custom ``->bi_end_io`` function for internal purposes, such as
- deferring the ioend completion to a workqueue to run metadata update
- transactions from process context before submitting the bio.
- This function is optional.
- - ``discard_folio``: iomap calls this function after ``->map_blocks``
- fails to schedule I/O for any part of a dirty folio.
- The function should throw away any reservations that may have been
- made for the write.
+ If this methods fails to schedule I/O for any part of a dirty folio, it
+ should throw away any reservations that may have been made for the write.
The folio will be marked clean and an ``-EIO`` recorded in the
pagecache.
Filesystems can use this callback to `remove
<https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
delalloc reservations to avoid having delalloc reservations for
clean pagecache.
- This function is optional.
+ This function must be supplied by the filesystem.
+
+ - ``writeback_submit``: Submit the previous built writeback context.
+ Block based file systems should use the iomap_ioend_writeback_submit
+ helper, other file system can implement their own.
+ File systems can optionall to hook into writeback bio submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function must be supplied by the filesystem.
Pagecache Writeback Completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -340,10 +340,9 @@ If the write failed, it will also set the error bits on the folios and
the address space.
This can happen in interrupt or process context, depending on the
storage device.
-
Filesystems that need to update internal bookkeeping (e.g. unwritten
-extent conversions) should provide a ``->submit_ioend`` function to
-set ``struct iomap_end::bio::bi_end_io`` to its own function.
+extent conversions) should set their own bi_end_io on the bios
+submitted by ``->submit_writeback``
This function should call ``iomap_finish_ioends`` after finishing its
own work (e.g. unwritten extent conversion).
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 2e567e341c3b..aa287ccdac2f 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -87,8 +87,8 @@ prototypes::
int (*tmpfile) (struct mnt_idmap *, struct inode *,
struct file *, umode_t);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
@@ -253,10 +253,10 @@ prototypes::
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
struct folio **foliop, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index a5734bdd1cc7..85f590254f07 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1224,9 +1224,6 @@ lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
take a qstr instead of separate name and length. QSTR() can be used
when strlen() is needed for the length.
-For try_lookup_noperm() a reference to the qstr is passed in case the
-hash might subsequently be needed.
-
These function no longer do any permission checking - they previously
checked that the caller has 'X' permission on the parent. They must
ONLY be used internally by a filesystem on itself when it knows that
@@ -1258,3 +1255,33 @@ iterator needed. Instead of a cloned mount tree, the new interface returns
an array of struct path, one for each mount collect_mounts() would've
created. These struct path point to locations in the caller's namespace
that would be roots of the cloned mounts.
+
+---
+
+**mandatory**
+
+If your filesystem sets the default dentry_operations, use set_default_d_op()
+rather than manually setting sb->s_d_op.
+
+---
+
+**mandatory**
+
+d_set_d_op() is no longer exported (or public, for that matter); _if_
+your filesystem really needed that, make use of d_splice_alias_ops()
+to have them set. Better yet, think hard whether you need different
+->d_op for different dentries - if not, just use set_default_d_op()
+at mount time and be done with that. Currently procfs is the only
+thing that really needs ->d_op varying between dentries.
+
+---
+
+**highly recommended**
+
+The file operations mmap() callback is deprecated in favour of
+mmap_prepare(). This passes a pointer to a vm_area_desc to the callback
+rather than a VMA, as the VMA at this stage is not yet valid.
+
+The vm_area_desc provides the minimum required information for a filesystem
+to initialise state upon memory mapping of a file-backed region, and output
+parameters for the file system to set this state.
diff --git a/Documentation/filesystems/propagate_umount.txt b/Documentation/filesystems/propagate_umount.txt
new file mode 100644
index 000000000000..c90349e5b889
--- /dev/null
+++ b/Documentation/filesystems/propagate_umount.txt
@@ -0,0 +1,484 @@
+ Notes on propagate_umount()
+
+Umount propagation starts with a set of mounts we are already going to
+take out. Ideally, we would like to add all downstream cognates to
+that set - anything with the same mountpoint as one of the removed
+mounts and with parent that would receive events from the parent of that
+mount. However, there are some constraints the resulting set must
+satisfy.
+
+It is convenient to define several properties of sets of mounts:
+
+1) A set S of mounts is non-shifting if for any mount X belonging
+to S all subtrees mounted strictly inside of X (i.e. not overmounting
+the root of X) contain only elements of S.
+
+2) A set S is non-revealing if all locked mounts that belong to S have
+parents that also belong to S.
+
+3) A set S is closed if it contains all children of its elements.
+
+The set of mounts taken out by umount(2) must be non-shifting and
+non-revealing; the first constraint is what allows to reparent
+any remaining mounts and the second is what prevents the exposure
+of any concealed mountpoints.
+
+propagate_umount() takes the original set as an argument and tries to
+extend that set. The original set is a full subtree and its root is
+unlocked; what matters is that it's closed and non-revealing.
+Resulting set may not be closed; there might still be mounts outside
+of that set, but only on top of stacks of root-overmounting elements
+of set. They can be reparented to the place where the bottom of
+stack is attached to a mount that will survive. NOTE: doing that
+will violate a constraint on having no more than one mount with
+the same parent/mountpoint pair; however, the caller (umount_tree())
+will immediately remedy that - it may keep unmounted element attached
+to parent, but only if the parent itself is unmounted. Since all
+conflicts created by reparenting have common parent *not* in the
+set and one side of the conflict (bottom of the stack of overmounts)
+is in the set, it will be resolved. However, we rely upon umount_tree()
+doing that pretty much immediately after the call of propagate_umount().
+
+Algorithm is based on two statements:
+ 1) for any set S, there is a maximal non-shifting subset of S
+and it can be calculated in O(#S) time.
+ 2) for any non-shifting set S, there is a maximal non-revealing
+subset of S. That subset is also non-shifting and it can be calculated
+in O(#S) time.
+
+ Finding candidates.
+
+We are given a closed set U and we want to find all mounts that have
+the same mountpoint as some mount m in U *and* whose parent receives
+propagation from the parent of the same mount m. Naive implementation
+would be
+ S = {}
+ for each m in U
+ add m to S
+ p = parent(m)
+ for each q in Propagation(p) - {p}
+ child = look_up(q, mountpoint(m))
+ if child
+ add child to S
+but that can lead to excessive work - there might be propagation among the
+subtrees of U, in which case we'd end up examining the same candidates
+many times. Since propagation is transitive, the same will happen to
+everything downstream of that candidate and it's not hard to construct
+cases where the approach above leads to the time quadratic by the actual
+number of candidates.
+
+Note that if we run into a candidate we'd already seen, it must've been
+added on an earlier iteration of the outer loop - all additions made
+during one iteration of the outer loop have different parents. So
+if we find a child already added to the set, we know that everything
+in Propagation(parent(child)) with the same mountpoint has been already
+added.
+ S = {}
+ for each m in U
+ if m in S
+ continue
+ add m to S
+ p = parent(m)
+ q = propagation_next(p, p)
+ while q
+ child = look_up(q, mountpoint(m))
+ if child
+ if child in S
+ q = skip_them(q, p)
+ continue;
+ add child to S
+ q = propagation_next(q, p)
+where
+skip_them(q, p)
+ keep walking Propagation(p) from q until we find something
+ not in Propagation(q)
+
+would get rid of that problem, but we need a sane implementation of
+skip_them(). That's not hard to do - split propagation_next() into
+"down into mnt_slave_list" and "forward-and-up" parts, with the
+skip_them() being "repeat the forward-and-up part until we get NULL
+or something that isn't a peer of the one we are skipping".
+
+Note that there can be no absolute roots among the extra candidates -
+they all come from mount lookups. Absolute root among the original
+set is _currently_ impossible, but it might be worth protecting
+against.
+
+ Maximal non-shifting subsets.
+
+Let's call a mount m in a set S forbidden in that set if there is a
+subtree mounted strictly inside m and containing mounts that do not
+belong to S.
+
+The set is non-shifting when none of its elements are forbidden in it.
+
+If mount m is forbidden in a set S, it is forbidden in any subset S' it
+belongs to. In other words, it can't belong to any of the non-shifting
+subsets of S. If we had a way to find a forbidden mount or show that
+there's none, we could use it to find the maximal non-shifting subset
+simply by finding and removing them until none remain.
+
+Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
+must have been forbidden in S itself. Indeed, since m has descendents
+that do not belong to S, any subtree that fits into S will fit into
+S - {m} as well.
+
+So in principle we could go through elements of S, checking if they
+are forbidden in S and removing the ones that are. Removals will
+not invalidate the checks done for earlier mounts - if they were not
+forbidden at the time we checked, they won't become forbidden later.
+It's too costly to be practical, but there is a similar approach that
+is linear by size of S.
+
+Let's say that mount x in a set S is forbidden by mount y, if
+ * both x and y belong to S.
+ * there is a chain of mounts starting at x and leaving S
+ immediately after passing through y, with the first
+ mountpoint strictly inside x.
+Note 1: x may be equal to y - that's the case when something not
+belonging to S is mounted strictly inside x.
+Note 2: if y does not belong to S, it can't forbid anything in S.
+Note 3: if y has no children outside of S, it can't forbid anything in S.
+
+It's easy to show that mount x is forbidden in S if and only if x is
+forbidden in S by some mount y. And it's easy to find all mounts in S
+forbidden by a given mount.
+
+Consider the following operation:
+ Trim(S, m) = S - {x : x is forbidden by m in S}
+
+Note that if m does not belong to S or has no children outside of S we
+are guaranteed that Trim(S, m) is equal to S.
+
+The following is true: if x is forbidden by y in Trim(S, m), it was
+already forbidden by y in S.
+
+Proof: Suppose x is forbidden by y in Trim(S, m). Then there is a
+chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
+is the first element that doesn't belong to Trim(S, m) and the
+mountpoint of x_1 is strictly inside x. If mount r belongs to S, it must
+have been removed by Trim(S, m), i.e. it was forbidden in S by m.
+Then there was a mount chain from r to some child of m that stayed in
+S all the way until m, but that's impossible since x belongs to Trim(S, m)
+and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
+forbidden in S by m, and thus can't belong to Trim(S, m).
+Therefore r can not belong to S and our chain demonstrates that
+x is forbidden by y in S. QED.
+
+Corollary: no mount is forbidden by m in Trim(S, m). Indeed, any
+such mount would have been forbidden by m in S and thus would have been
+in the part of S removed in Trim(S, m).
+
+Corollary: no mount is forbidden by m in Trim(Trim(S, m), n). Indeed,
+any such would have to have been forbidden by m in Trim(S, m), which
+is impossible.
+
+Corollary: after
+ S = Trim(S, x_1)
+ S = Trim(S, x_2)
+ ...
+ S = Trim(S, x_k)
+no mount remaining in S will be forbidden by either of x_1,...,x_k.
+
+The following will reduce S to its maximal non-shifting subset:
+ visited = {}
+ while S contains elements not belonging to visited
+ let m be an arbitrary such element of S
+ S = Trim(S, m)
+ add m to visited
+
+S never grows, so the number of elements of S not belonging to visited
+decreases at least by one on each iteration. When the loop terminates,
+all mounts remaining in S belong to visited. It's easy to see that at
+the beginning of each iteration no mount remaining in S will be forbidden
+by any element of visited. In other words, no mount remaining in S will
+be forbidden, i.e. final value of S will be non-shifting. It will be
+the maximal non-shifting subset, since we were removing only forbidden
+elements.
+
+ There are two difficulties in implementing the above in linear
+time, both due to the fact that Trim() might need to remove more than one
+element. Naive implementation of Trim() is vulnerable to running into a
+long chain of mounts, each mounted on top of parent's root. Nothing in
+that chain is forbidden, so nothing gets removed from it. We need to
+recognize such chains and avoid walking them again on subsequent calls of
+Trim(), otherwise we will end up with worst-case time being quadratic by
+the number of elements in S. Another difficulty is in implementing the
+outer loop - we need to iterate through all elements of a shrinking set.
+That would be trivial if we never removed more than one element at a time
+(linked list, with list_for_each_entry_safe for iterator), but we may
+need to remove more than one entry, possibly including the ones we have
+already visited.
+
+ Let's start with naive algorithm for Trim():
+
+Trim_one(m)
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove m from S
+ break
+ if found
+ Trim_ancestors(m)
+
+Trim_ancestors(m)
+ for (; parent(m) in S; m = parent(m)) {
+ if (mountpoint(m) != root(parent(m)))
+ remove parent(m) from S
+ }
+
+If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
+Proof:
+ Consider the chains excluding elements from Trim(S, m). The last
+two elements in such chain are m and some child of m that does not belong
+to S. If m has no such children, Trim(S, m) is equal to S.
+ m itself is removed if and only if the chain has exactly two
+elements, i.e. when the last element does not overmount the root of m.
+In other words, that happens when m has a child not in S that does not
+overmount the root of m.
+ All other elements to remove will be ancestors of m, such that
+the entire descent chain from them to m is contained in S. Let
+(x_0, x_1, ..., x_k = m) be the longest such chain. x_i needs to be
+removed if and only if x_{i+1} does not overmount its root. It's easy
+to see that Trim_ancestors(m) will iterate through that chain from
+x_k to x_1 and that it will remove exactly the elements that need to be
+removed.
+
+ Note that if the loop in Trim_ancestors() walks into an already
+visited element, we are guaranteed that remaining iterations will see
+only elements that had already been visited and remove none of them.
+That's the weakness that makes it vulnerable to long chains of full
+overmounts.
+
+ It's easy to deal with, if we can afford setting marks on
+elements of S; we would mark all elements already visited by
+Trim_ancestors() and have it bail out as soon as it sees an already
+marked element.
+
+ The problems with iterating through the set can be dealt with in
+several ways, depending upon the representation we choose for our set.
+One useful observation is that we are given a closed subset in S - the
+original set passed to propagate_umount(). Its elements can neither
+forbid anything nor be forbidden by anything - all their descendents
+belong to S, so they can not occur anywhere in any excluding chain.
+In other words, the elements of that subset will remain in S until
+the end and Trim_one(S, m) is a no-op for all m from that subset.
+
+ That suggests keeping S as a disjoint union of a closed set U
+('will be unmounted, no matter what') and the set of all elements of
+S that do not belong to U. That set ('candidates') is all we need
+to iterate through. Let's represent it as a subset in a cyclic list,
+consisting of all list elements that are marked as candidates (initially -
+all of them). Then we could have Trim_ancestors() only remove the mark,
+leaving the elements on the list. Then Trim_one() would never remove
+anything other than its argument from the containing list, allowing to
+use list_for_each_entry_safe() as iterator.
+
+ Assuming that representation we get the following:
+
+ list_for_each_entry_safe(m, ..., Candidates, ...)
+ Trim_one(m)
+where
+Trim_one(m)
+ if (m is not marked as a candidate)
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the Candidates list
+ return
+
+ remove_this = false
+ found = false
+ for each n in children(m)
+ if n not in S
+ found = true
+ if (mountpoint(n) != root(m))
+ remove_this = true
+ break
+ if found
+ Trim_ancestors(m)
+ if remove_this
+ strip the "seen by Trim_ancestors" mark from m
+ strip the "candidate" mark from m
+ remove m from the Candidate list
+
+Trim_ancestors(m)
+ for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
+ if m is marked as seen by Trim_ancestors
+ return
+ mark m as seen by Trim_ancestors
+ if (mountpoint(m) != root(p))
+ strip the "candidate" mark from p
+ }
+
+ Terminating condition in the loop in Trim_ancestors() is correct,
+since that that loop will never run into p belonging to U - p is always
+an ancestor of argument of Trim_one() and since U is closed, the argument
+of Trim_one() would also have to belong to U. But Trim_one() is never
+called for elements of U. In other words, p belongs to S if and only
+if it belongs to candidates.
+
+ Time complexity:
+* we get no more than O(#S) calls of Trim_one()
+* the loop over children in Trim_one() never looks at the same child
+twice through all the calls.
+* iterations of that loop for children in S are no more than O(#S)
+in the worst case
+* at most two children that are not elements of S are considered per
+call of Trim_one().
+* the loop in Trim_ancestors() sets its mark once per iteration and
+no element of S has is set more than once.
+
+ In the end we may have some elements excluded from S by
+Trim_ancestors() still stuck on the list. We could do a separate
+loop removing them from the list (also no worse than O(#S) time),
+but it's easier to leave that until the next phase - there we will
+iterate through the candidates anyway.
+
+ The caller has already removed all elements of U from their parents'
+lists of children, which means that checking if child belongs to S is
+equivalent to checking if it's marked as a candidate; we'll never see
+the elements of U in the loop over children in Trim_one().
+
+ What's more, if we see that children(m) is empty and m is not
+locked, we can immediately move m into the committed subset (remove
+from the parent's list of children, etc.). That's one fewer mount we'll
+have to look into when we check the list of children of its parent *and*
+when we get to building the non-revealing subset.
+
+ Maximal non-revealing subsets
+
+If S is not a non-revealing subset, there is a locked element x in S
+such that parent of x is not in S.
+
+Obviously, no non-revealing subset of S may contain x. Removing such
+elements one by one will obviously end with the maximal non-revealing
+subset (possibly empty one). Note that removal of an element will
+require removal of all its locked children, etc.
+
+If the set had been non-shifting, it will remain non-shifting after
+such removals.
+Proof: suppose S was non-shifting, x is a locked element of S, parent of x
+is not in S and S - {x} is not non-shifting. Then there is an element m
+in S - {x} and a subtree mounted strictly inside m, such that m contains
+an element not in in S - {x}. Since S is non-shifting, everything in
+that subtree must belong to S. But that means that this subtree must
+contain x somewhere *and* that parent of x either belongs that subtree
+or is equal to m. Either way it must belong to S. Contradiction.
+
+// same representation as for finding maximal non-shifting subsets:
+// S is a disjoint union of a non-revealing set U (the ones we are committed
+// to unmount) and a set of candidates, represented as a subset of list
+// elements that have "is a candidate" mark on them.
+// Elements of U are removed from their parents' lists of children.
+// In the end candidates becomes empty and maximal non-revealing non-shifting
+// subset of S is now in U
+ while (Candidates list is non-empty)
+ handle_locked(first(Candidates))
+
+handle_locked(m)
+ if m is not marked as a candidate
+ strip the "seen by Trim_ancestors" mark from m
+ remove m from the list
+ return
+ cutoff = m
+ for (p = m; p in candidates; p = parent(p)) {
+ strip the "seen by Trim_ancestors" mark from p
+ strip the "candidate" mark from p
+ remove p from the Candidates list
+ if (!locked(p))
+ cutoff = parent(p)
+ }
+ if p in U
+ cutoff = p
+ while m != cutoff
+ remove m from children(parent(m))
+ add m to U
+ m = parent(m)
+
+Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
+* If it contains some elements of U, let x_k be the last one of those.
+Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
+* otherwise if all its elements are locked, then none of {x_0, ..., x_n}
+may be elements of a non-revealing subset of S.
+* otherwise let x_k be the first unlocked element of the chain. Then none
+of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
+S and union of U and {x_k, ..., x_n} is non-revealing.
+
+handle_locked(m) finds which of these cases applies and adjusts Candidates
+and U accordingly. U remains non-revealing, union of Candidates and
+U still contains any non-revealing subset of S and after the call of
+handle_locked(m) m is guaranteed to be not in Candidates list. So having
+it called for each element of S would suffice to empty Candidates,
+leaving U the maximal non-revealing subset of S.
+
+However, handle_locked(m) is a no-op when m belongs to U, so it's enough
+to have it called for elements of Candidates list until none remain.
+
+Time complexity: number of calls of handle_locked() is limited by
+#Candidates, each iteration of the first loop in handle_locked() removes
+an element from the list, so their total number of executions is also
+limited by #Candidates; number of iterations in the second loop is no
+greater than the number of iterations of the first loop.
+
+
+ Reparenting
+
+After we'd calculated the final set, we still need to deal with
+reparenting - if an element of the final set has a child not in it,
+we need to reparent such child.
+
+Such children can only be root-overmounting (otherwise the set wouldn't
+be non-shifting) and their parents can not belong to the original set,
+since the original is guaranteed to be closed.
+
+
+ Putting all of that together
+
+The plan is to
+ * find all candidates
+ * trim down to maximal non-shifting subset
+ * trim down to maximal non-revealing subset
+ * reparent anything that needs to be reparented
+ * return the resulting set to the caller
+
+For the 2nd and 3rd steps we want to separate the set into growing
+non-revealing subset, initially containing the original set ("U" in
+terms of the pseudocode above) and everything we are still not sure about
+("candidates"). It means that for the output of the 1st step we'd like
+the extra candidates separated from the stuff already in the original set.
+For the 4th step we would like the additions to U separate from the
+original set.
+
+So let's go for
+ * original set ("set"). Linkage via mnt_list
+ * undecided candidates ("candidates"). Subset of a list,
+consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE).
+Initially all elements of the list will be marked that way; in the
+end the list will become empty and no mounts will remain marked with
+that flag.
+ * Reuse T_MARKED for "has been already seen by trim_ancestors()".
+ * anything in U that hadn't been in the original set - elements of
+candidates will gradually be either discarded or moved there. In other
+words, it's the candidates we have already decided to unmount. Its role
+is reasonably close to the old "to_umount", so let's use that name.
+Linkage via mnt_list.
+
+For gather_candidates() we'll need to maintain both candidates (S -
+set) and intersection of S with set. Use T_UMOUNT_CANDIDATE for
+all elements we encounter, putting the ones not already in the original
+set into the list of candidates. When we are done, strip that flag from
+all elements of the original set. That gives a cheap way to check
+if element belongs to S (in gather_candidates) and to candidates
+itself (at later stages). Call that predicate is_candidate(); it would
+be m->mnt_t_flags & T_UMOUNT_CANDIDATE.
+
+All elements of the original set are marked with MNT_UMOUNT and we'll
+need the same for elements added when joining the contents of to_umount
+to set in the end. Let's set MNT_UMOUNT at the time we add an element
+to to_umount; that's close to what the old 'umount_one' is doing, so
+let's keep that name. It also gives us another predicate we need -
+"belongs to union of set and to_umount"; will_be_unmounted() for now.
+
+Removals from the candidates list should strip both T_MARKED and
+T_UMOUNT_CANDIDATE; call it remove_from_candidates_list().
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index fd32a9a17bfb..486a91633474 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,8 +515,8 @@ As of kernel 2.6.22, the following members are defined:
struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
int (*fileattr_set)(struct mnt_idmap *idmap,
- struct dentry *dentry, struct fileattr *fa);
- int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct dentry *dentry, struct file_kattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct file_kattr *fa);
struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
@@ -758,8 +758,9 @@ process is more complicated and uses write_begin/write_end or
dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
-Adding and removing pages to/from an address_space is protected by the
-inode's i_mutex.
+Removing pages from an address_space requires holding the inode's i_rwsem
+exclusively, while adding pages to the address_space requires holding the
+inode's i_mapping->invalidate_lock exclusively.
When data is written to a page, the PG_Dirty flag should be set. It
typically remains set until writepages asks for it to be written. This
@@ -822,10 +823,10 @@ cache in your filesystem. The following members are defined:
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
- int (*write_begin)(struct file *, struct address_space *mapping,
+ int (*write_begin)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
- int (*write_end)(struct file *, struct address_space *mapping,
+ struct page **pagep, void **fsdata);
+ int (*write_end)(const struct kiocb *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
@@ -1071,12 +1072,14 @@ This describes how the VFS can manipulate an open file. As of kernel
struct file_operations {
struct module *owner;
+ fop_flags_t fop_flags;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
- int (*iopoll)(struct kiocb *kiocb, bool spin);
+ int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *,
+ unsigned int flags);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1093,18 +1096,24 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
- int (*setlease)(struct file *, long, struct file_lock **, void **);
+ void (*splice_eof)(struct file *file);
+ int (*setlease)(struct file *, int, struct file_lease **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
- ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
+ int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
+ int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
+ unsigned int poll_flags);
+ int (*mmap_prepare)(struct vm_area_desc *);
};
Again, all methods are called without any locks being held, unless
@@ -1144,7 +1153,8 @@ otherwise noted.
used on 64 bit kernels.
``mmap``
- called by the mmap(2) system call
+ called by the mmap(2) system call. Deprecated in favour of
+ ``mmap_prepare``.
``open``
called by the VFS when an inode should be opened. When the VFS
@@ -1221,6 +1231,11 @@ otherwise noted.
``fadvise``
possibly called by the fadvise64() system call.
+``mmap_prepare``
+ Called by the mmap(2) system call. Allows a VFS to set up a
+ file-backed memory mapping, most notably establishing relevant
+ private state and VMA callbacks.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
diff --git a/Documentation/nvme/nvme-pci-endpoint-target.rst b/Documentation/nvme/nvme-pci-endpoint-target.rst
index b699595d1762..69edf44f0d9a 100644
--- a/Documentation/nvme/nvme-pci-endpoint-target.rst
+++ b/Documentation/nvme/nvme-pci-endpoint-target.rst
@@ -6,21 +6,21 @@ NVMe PCI Endpoint Function Target
:Author: Damien Le Moal <dlemoal@kernel.org>
-The NVMe PCI endpoint function target driver implements a NVMe PCIe controller
-using a NVMe fabrics target controller configured with the PCI transport type.
+The NVMe PCI endpoint function target driver implements an NVMe PCIe controller
+using an NVMe fabrics target controller configured with the PCI transport type.
Overview
========
-The NVMe PCI endpoint function target driver allows exposing a NVMe target
+The NVMe PCI endpoint function target driver allows exposing an NVMe target
controller over a PCIe link, thus implementing an NVMe PCIe device similar to a
regular M.2 SSD. The target controller is created in the same manner as when
using NVMe over fabrics: the controller represents the interface to an NVMe
subsystem using a port. The port transfer type must be configured to be
"pci". The subsystem can be configured to have namespaces backed by regular
files or block devices, or can use NVMe passthrough to expose to the PCI host an
-existing physical NVMe device or a NVMe fabrics host controller (e.g. a NVMe TCP
-host controller).
+existing physical NVMe device or an NVMe fabrics host controller (e.g. a NVMe
+TCP host controller).
The NVMe PCI endpoint function target driver relies as much as possible on the
NVMe target core code to parse and execute NVMe commands submitted by the PCIe
@@ -181,10 +181,10 @@ Creating an NVMe endpoint device is a two step process. First, an NVMe target
subsystem and port must be defined. Second, the NVMe PCI endpoint device must
be setup and bound to the subsystem and port created.
-Creating a NVMe Subsystem and Port
-----------------------------------
+Creating an NVMe Subsystem and Port
+-----------------------------------
-Details about how to configure a NVMe target subsystem and port are outside the
+Details about how to configure an NVMe target subsystem and port are outside the
scope of this document. The following only provides a simple example of a port
and subsystem with a single namespace backed by a null_blk device.
@@ -234,8 +234,8 @@ Finally, create the target port and link it to the subsystem::
# ln -s /sys/kernel/config/nvmet/subsystems/nvmepf.0.nqn \
/sys/kernel/config/nvmet/ports/1/subsystems/nvmepf.0.nqn
-Creating a NVMe PCI Endpoint Device
------------------------------------
+Creating an NVMe PCI Endpoint Device
+------------------------------------
With the NVMe target subsystem and port ready for use, the NVMe PCI endpoint
device can now be created and enabled. The NVMe PCI endpoint target driver
@@ -303,7 +303,7 @@ device controller::
nvmet_pci_epf nvmet_pci_epf.0: Enabling controller
-On the host side, the NVMe PCI endpoint function target device will is
+On the host side, the NVMe PCI endpoint function target device is
discoverable as a PCI device, with the vendor ID and device ID as configured::
# lspci -n
diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst
index 2aa0791bcefe..d0191c8b8060 100644
--- a/Documentation/security/credentials.rst
+++ b/Documentation/security/credentials.rst
@@ -555,5 +555,5 @@ the VFS, and that can be done by calling into such as ``vfs_mkdir()`` with a
different set of credentials. This is done in the following places:
* ``sys_faccessat()``.
- * ``do_coredump()``.
+ * ``vfs_coredump()``.
* nfs4recover.c.
diff --git a/Documentation/security/self-protection.rst b/Documentation/security/self-protection.rst
index 910668e665cb..a32ca23c21b0 100644
--- a/Documentation/security/self-protection.rst
+++ b/Documentation/security/self-protection.rst
@@ -303,7 +303,7 @@ Memory poisoning
When releasing memory, it is best to poison the contents, to avoid reuse
attacks that rely on the old contents of memory. E.g., clear stack on a
-syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
+syscall return (``CONFIG_KSTACK_ERASE``), wipe heap memory on a
free. This frustrates many uninitialized variable attacks, stack content
exposures, heap content exposures, and use-after-free attacks.
diff --git a/Documentation/translations/zh_CN/security/credentials.rst b/Documentation/translations/zh_CN/security/credentials.rst
index 91c353dfb622..88fcd9152ffe 100644
--- a/Documentation/translations/zh_CN/security/credentials.rst
+++ b/Documentation/translations/zh_CN/security/credentials.rst
@@ -475,5 +475,5 @@ const指针上操作,因此不需要进行类型转换,但需要临时放弃
如 ``vfs_mkdir()`` 来实现。以下是一些进行此操作的位置:
* ``sys_faccessat()``.
- * ``do_coredump()``.
+ * ``vfs_coredump()``.
* nfs4recover.c.
diff --git a/Documentation/translations/zh_CN/security/self-protection.rst b/Documentation/translations/zh_CN/security/self-protection.rst
index 3c8a68b1e1be..93de9cee5c1a 100644
--- a/Documentation/translations/zh_CN/security/self-protection.rst
+++ b/Documentation/translations/zh_CN/security/self-protection.rst
@@ -259,7 +259,7 @@ KALLSYSM,则会直接打印原始地址。
--------
在释放内存时,最好对内存内容进行清除处理,以防止攻击者重用内存中以前
-的内容。例如,在系统调用返回时清除堆栈(CONFIG_GCC_PLUGIN_STACKLEAK),
+的内容。例如,在系统调用返回时清除堆栈(CONFIG_KSTACK_ERASE),
在释放堆内容是清除其内容。这有助于防止许多未初始化变量攻击、堆栈内容
泄露、堆内容泄露以及使用后释放攻击(user-after-free)。
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index bc91756bde73..4f1532a251d2 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -220,7 +220,6 @@ Code Seq# Include File Comments
include/linux/falloc.h,
linux/fs.h,
'X' all fs/ocfs2/ocfs_fs.h conflict!
-'X' 01 linux/pktcdvd.h conflict!
'Z' 14-15 drivers/message/fusion/mptctl.h
'[' 00-3F linux/usb/tmc.h USB Test and Measurement Devices
<mailto:gregkh@linuxfoundation.org>