summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/stable/sysfs-block3
-rw-r--r--Documentation/admin-guide/LSM/ipe.rst7
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst11
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt11
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst2
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst20
-rw-r--r--Documentation/admin-guide/sysctl/fs.rst5
-rw-r--r--Documentation/core-api/protection-keys.rst38
-rw-r--r--Documentation/devicetree/bindings/display/elgin,jg10309-01.yaml54
-rw-r--r--Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml24
-rw-r--r--Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml19
-rw-r--r--Documentation/devicetree/bindings/firmware/arm,scmi.yaml2
-rw-r--r--Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml21
-rw-r--r--Documentation/devicetree/bindings/iio/dac/adi,ad5686.yaml53
-rw-r--r--Documentation/devicetree/bindings/iio/dac/adi,ad5696.yaml3
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml26
-rw-r--r--Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml2
-rw-r--r--Documentation/devicetree/bindings/net/brcm,unimac-mdio.yaml1
-rw-r--r--Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml2
-rw-r--r--Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml5
-rw-r--r--Documentation/devicetree/bindings/sound/davinci-mcasp-audio.yaml18
-rw-r--r--Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml4
-rw-r--r--Documentation/devicetree/bindings/trivial-devices.yaml2
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst2
-rw-r--r--Documentation/filesystems/index.rst1
-rw-r--r--Documentation/filesystems/iomap/operations.rst17
-rw-r--r--Documentation/filesystems/multigrain-ts.rst125
-rw-r--r--Documentation/filesystems/netfs_library.rst1
-rw-r--r--Documentation/filesystems/nfs/exporting.rst7
-rw-r--r--Documentation/filesystems/overlayfs.rst17
-rw-r--r--Documentation/filesystems/tmpfs.rst24
-rw-r--r--Documentation/iio/ad7380.rst13
-rw-r--r--Documentation/mm/damon/maintainer-profile.rst38
-rw-r--r--Documentation/netlink/specs/mptcp_pm.yaml1
-rw-r--r--Documentation/networking/devmem.rst9
-rw-r--r--Documentation/networking/j1939.rst2
-rw-r--r--Documentation/networking/packet_mmap.rst5
-rw-r--r--Documentation/networking/tcp_ao.rst20
-rw-r--r--Documentation/process/maintainer-netdev.rst17
-rw-r--r--Documentation/process/maintainer-soc.rst42
-rw-r--r--Documentation/rust/arch-support.rst2
-rw-r--r--Documentation/scheduler/sched-ext.rst2
-rw-r--r--Documentation/security/landlock.rst14
-rw-r--r--Documentation/userspace-api/landlock.rst90
-rw-r--r--Documentation/userspace-api/mseal.rst307
-rw-r--r--Documentation/virt/kvm/api.rst16
-rw-r--r--Documentation/virt/kvm/locking.rst2
47 files changed, 743 insertions, 364 deletions
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index cea8856f798d..7a820a7d53aa 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -594,6 +594,9 @@ Description:
[RW] Maximum number of kilobytes to read-ahead for filesystems
on this block device.
+ For MADV_HUGEPAGE, the readahead size may exceed this setting
+ since its granularity is based on the hugepage size.
+
What: /sys/block/<disk>/queue/rotational
Date: January 2009
diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst
index f38e641df0e9..f93a467db628 100644
--- a/Documentation/admin-guide/LSM/ipe.rst
+++ b/Documentation/admin-guide/LSM/ipe.rst
@@ -223,7 +223,10 @@ are signed through the PKCS#7 message format to enforce some level of
authorization of the policies (prohibiting an attacker from gaining
unconstrained root, and deploying an "allow all" policy). These
policies must be signed by a certificate that chains to the
-``SYSTEM_TRUSTED_KEYRING``. With openssl, the policy can be signed by::
+``SYSTEM_TRUSTED_KEYRING``, or to the secondary and/or platform keyrings if
+``CONFIG_IPE_POLICY_SIG_SECONDARY_KEYRING`` and/or
+``CONFIG_IPE_POLICY_SIG_PLATFORM_KEYRING`` are enabled, respectively.
+With openssl, the policy can be signed by::
openssl smime -sign \
-in "$MY_POLICY" \
@@ -266,7 +269,7 @@ in the kernel. This file is write-only and accepts a PKCS#7 signed
policy. Two checks will always be performed on this policy: First, the
``policy_names`` must match with the updated version and the existing
version. Second the updated policy must have a policy version greater than
-or equal to the currently-running version. This is to prevent rollback attacks.
+the currently-running version. This is to prevent rollback attacks.
The ``delete`` file is used to remove a policy that is no longer needed.
This file is write-only and accepts a value of ``1`` to delete the policy.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69af2173555f..2cb58daf3089 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1599,6 +1599,15 @@ The following nested keys are defined.
pglazyfreed (npn)
Amount of reclaimed lazyfree pages
+ swpin_zero
+ Number of pages swapped into memory and filled with zero, where I/O
+ was optimized out because the page content was detected to be zero
+ during swapout.
+
+ swpout_zero
+ Number of zero-filled pages swapped out with I/O skipped due to the
+ content being detected as zero.
+
zswpin
Number of pages moved in to memory from zswap.
@@ -2945,7 +2954,7 @@ following two functions.
a queue (device) has been associated with the bio and
before submission.
- wbc_account_cgroup_owner(@wbc, @page, @bytes)
+ wbc_account_cgroup_owner(@wbc, @folio, @bytes)
Should be called for each data segment being written out.
While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1518343bbe22..d401577b5a6a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6688,7 +6688,7 @@
0: no polling (default)
thp_anon= [KNL]
- Format: <size>,<size>[KMG]:<state>;<size>-<size>[KMG]:<state>
+ Format: <size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>
state is one of "always", "madvise", "never" or "inherit".
Control the default behavior of the system with respect
to anonymous transparent hugepages.
@@ -6727,6 +6727,15 @@
torture.verbose_sleep_duration= [KNL]
Duration of each verbose-printk() sleep in jiffies.
+ tpm.disable_pcr_integrity= [HW,TPM]
+ Do not protect PCR registers from unintended physical
+ access, or interposers in the bus by the means of
+ having an integrity protected session wrapped around
+ TPM2_PCR_Extend command. Consider this in a situation
+ where TPM is heavily utilized by IMA, thus protection
+ causing a major performance hit, and the space where
+ machines are deployed is by other means guarded.
+
tpm_suspend_pcr=[HW,TPM]
Format: integer pcr id
Specify that at suspend time, the tpm driver
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index cfdd16a52e39..a1bb495eab59 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -303,7 +303,7 @@ control by passing the parameter ``transparent_hugepage=always`` or
kernel command line.
Alternatively, each supported anonymous THP size can be controlled by
-passing ``thp_anon=<size>,<size>[KMG]:<state>;<size>-<size>[KMG]:<state>``,
+passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
``never`` or ``inherit``.
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index fe1be4ad88cb..a21369eba034 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -425,8 +425,8 @@ This governor exposes only one tunable:
``rate_limit_us``
Minimum time (in microseconds) that has to pass between two consecutive
- runs of governor computations (default: 1000 times the scaling driver's
- transition latency).
+ runs of governor computations (default: 1.5 times the scaling driver's
+ transition latency or the maximum 2ms).
The purpose of this tunable is to reduce the scheduler context overhead
of the governor which might be excessive without it.
@@ -474,17 +474,17 @@ This governor exposes the following tunables:
This is how often the governor's worker routine should run, in
microseconds.
- Typically, it is set to values of the order of 10000 (10 ms). Its
- default value is equal to the value of ``cpuinfo_transition_latency``
- for each policy this governor is attached to (but since the unit here
- is greater by 1000, this means that the time represented by
- ``sampling_rate`` is 1000 times greater than the transition latency by
- default).
+ Typically, it is set to values of the order of 2000 (2 ms). Its
+ default value is to add a 50% breathing room
+ to ``cpuinfo_transition_latency`` on each policy this governor is
+ attached to. The minimum is typically the length of two scheduler
+ ticks.
If this tunable is per-policy, the following shell command sets the time
- represented by it to be 750 times as high as the transition latency::
+ represented by it to be 1.5 times as high as the transition latency
+ (the default)::
- # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
+ # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate
``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 47499a1742bd..30c61474dec5 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -38,6 +38,11 @@ requests. ``aio-max-nr`` allows you to change the maximum value
``aio-max-nr`` does not result in the
pre-allocation or re-sizing of any kernel data structures.
+dentry-negative
+----------------------------
+
+Policy for negative dentries. Set to 1 to to always delete the dentry when a
+file is removed, and 0 to disable it. By default, this behavior is disabled.
dentry-state
------------
diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index bf28ac0401f3..7eb7c6023e09 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -12,7 +12,10 @@ Pkeys Userspace (PKU) is a feature which can be found on:
* Intel server CPUs, Skylake and later
* Intel client CPUs, Tiger Lake (11th Gen Core) and later
* Future AMD CPUs
+ * arm64 CPUs implementing the Permission Overlay Extension (FEAT_S1POE)
+x86_64
+======
Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
a "protection key", giving 16 possible keys.
@@ -28,6 +31,22 @@ register. The feature is only available in 64-bit mode, even though there is
theoretically space in the PAE PTEs. These permissions are enforced on data
access only and have no effect on instruction fetches.
+arm64
+=====
+
+Pkeys use 3 bits in each page table entry, to encode a "protection key index",
+giving 8 possible keys.
+
+Protections for each key are defined with a per-CPU user-writable system
+register (POR_EL0). This is a 64-bit register encoding read, write and execute
+overlay permissions for each protection key index.
+
+Being a CPU register, POR_EL0 is inherently thread-local, potentially giving
+each thread a different set of protections from every other thread.
+
+Unlike x86_64, the protection key permissions also apply to instruction
+fetches.
+
Syscalls
========
@@ -38,11 +57,10 @@ There are 3 system calls which directly interact with pkeys::
int pkey_mprotect(unsigned long start, size_t len,
unsigned long prot, int pkey);
-Before a pkey can be used, it must first be allocated with
-pkey_alloc(). An application calls the WRPKRU instruction
-directly in order to change access permissions to memory covered
-with a key. In this example WRPKRU is wrapped by a C function
-called pkey_set().
+Before a pkey can be used, it must first be allocated with pkey_alloc(). An
+application writes to the architecture specific CPU register directly in order
+to change access permissions to memory covered with a key. In this example
+this is wrapped by a C function called pkey_set().
::
int real_prot = PROT_READ|PROT_WRITE;
@@ -64,9 +82,9 @@ is no longer in use::
munmap(ptr, PAGE_SIZE);
pkey_free(pkey);
-.. note:: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions.
- An example implementation can be found in
- tools/testing/selftests/x86/protection_keys.c.
+.. note:: pkey_set() is a wrapper around writing to the CPU register.
+ Example implementations can be found in
+ tools/testing/selftests/mm/pkey-{arm64,powerpc,x86}.h
Behavior
========
@@ -96,3 +114,7 @@ with a read()::
The kernel will send a SIGSEGV in both cases, but si_code will be set
to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
the plain mprotect() permissions are violated.
+
+Note that kernel accesses from a kthread (such as io_uring) will use a default
+value for the protection key register and so will not be consistent with
+userspace's value of the register or mprotect().
diff --git a/Documentation/devicetree/bindings/display/elgin,jg10309-01.yaml b/Documentation/devicetree/bindings/display/elgin,jg10309-01.yaml
new file mode 100644
index 000000000000..faca0cb3f154
--- /dev/null
+++ b/Documentation/devicetree/bindings/display/elgin,jg10309-01.yaml
@@ -0,0 +1,54 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/display/elgin,jg10309-01.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Elgin JG10309-01 SPI-controlled display
+
+maintainers:
+ - Fabio Estevam <festevam@gmail.com>
+
+description: |
+ The Elgin JG10309-01 SPI-controlled display is used on the RV1108-Elgin-r1
+ board and is a custom display.
+
+allOf:
+ - $ref: /schemas/spi/spi-peripheral-props.yaml#
+
+properties:
+ compatible:
+ const: elgin,jg10309-01
+
+ reg:
+ maxItems: 1
+
+ spi-max-frequency:
+ maximum: 24000000
+
+ spi-cpha: true
+
+ spi-cpol: true
+
+required:
+ - compatible
+ - reg
+ - spi-cpha
+ - spi-cpol
+
+additionalProperties: false
+
+examples:
+ - |
+ spi {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ display@0 {
+ compatible = "elgin,jg10309-01";
+ reg = <0>;
+ spi-max-frequency = <24000000>;
+ spi-cpha;
+ spi-cpol;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml b/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml
index 3a82aec9021c..497c0eb4ed0b 100644
--- a/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml
+++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,dpi.yaml
@@ -63,6 +63,16 @@ properties:
- const: sleep
power-domains:
+ description: |
+ The MediaTek DPI module is typically associated with one of the
+ following multimedia power domains:
+ POWER_DOMAIN_DISPLAY
+ POWER_DOMAIN_VDOSYS
+ POWER_DOMAIN_MM
+ The specific power domain used varies depending on the SoC design.
+
+ It is recommended to explicitly add the appropriate power domain
+ property to the DPI node in the device tree.
maxItems: 1
port:
@@ -79,20 +89,6 @@ required:
- clock-names
- port
-allOf:
- - if:
- not:
- properties:
- compatible:
- contains:
- enum:
- - mediatek,mt6795-dpi
- - mediatek,mt8173-dpi
- - mediatek,mt8186-dpi
- then:
- properties:
- power-domains: false
-
additionalProperties: false
examples:
diff --git a/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml b/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml
index e4affc854f3d..4b6ff546757e 100644
--- a/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml
+++ b/Documentation/devicetree/bindings/display/mediatek/mediatek,split.yaml
@@ -38,6 +38,7 @@ properties:
description: A phandle and PM domain specifier as defined by bindings of
the power controller specified by phandle. See
Documentation/devicetree/bindings/power/power-domain.yaml for details.
+ maxItems: 1
mediatek,gce-client-reg:
description:
@@ -57,6 +58,9 @@ properties:
clocks:
items:
- description: SPLIT Clock
+ - description: Used for interfacing with the HDMI RX signal source.
+ - description: Paired with receiving HDMI RX metadata.
+ minItems: 1
required:
- compatible
@@ -72,9 +76,24 @@ allOf:
const: mediatek,mt8195-mdp3-split
then:
+ properties:
+ clocks:
+ minItems: 3
+
required:
- mediatek,gce-client-reg
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: mediatek,mt8173-disp-split
+
+ then:
+ properties:
+ clocks:
+ maxItems: 1
+
additionalProperties: false
examples:
diff --git a/Documentation/devicetree/bindings/firmware/arm,scmi.yaml b/Documentation/devicetree/bindings/firmware/arm,scmi.yaml
index 54d7d11bfed4..ff7a6f12cd00 100644
--- a/Documentation/devicetree/bindings/firmware/arm,scmi.yaml
+++ b/Documentation/devicetree/bindings/firmware/arm,scmi.yaml
@@ -124,7 +124,7 @@ properties:
atomic mode of operation, even if requested.
default: 0
- max-rx-timeout-ms:
+ arm,max-rx-timeout-ms:
description:
An optional time value, expressed in milliseconds, representing the
transport maximum timeout value for the receive channel. The value should
diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
index bd19abb867d9..0065d6508824 100644
--- a/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
+++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7380.yaml
@@ -67,6 +67,10 @@ properties:
A 2.5V to 3.3V supply for the external reference voltage. When omitted,
the internal 2.5V reference is used.
+ refin-supply:
+ description:
+ A 2.5V to 3.3V supply for external reference voltage, for ad7380-4 only.
+
aina-supply:
description:
The common mode voltage supply for the AINA- pin on pseudo-differential
@@ -135,6 +139,23 @@ allOf:
ainc-supply: false
aind-supply: false
+ # ad7380-4 uses refin-supply as external reference.
+ # All other chips from ad738x family use refio as optional external reference.
+ # When refio-supply is omitted, internal reference is used.
+ - if:
+ properties:
+ compatible:
+ enum:
+ - adi,ad7380-4
+ then:
+ properties:
+ refio-supply: false
+ required:
+ - refin-supply
+ else:
+ properties:
+ refin-supply: false
+
examples:
- |
#include <dt-bindings/interrupt-controller/irq.h>
diff --git a/Documentation/devicetree/bindings/iio/dac/adi,ad5686.yaml b/Documentation/devicetree/bindings/iio/dac/adi,ad5686.yaml
index b4400c52bec3..713f535bb33a 100644
--- a/Documentation/devicetree/bindings/iio/dac/adi,ad5686.yaml
+++ b/Documentation/devicetree/bindings/iio/dac/adi,ad5686.yaml
@@ -4,7 +4,7 @@
$id: http://devicetree.org/schemas/iio/dac/adi,ad5686.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
-title: Analog Devices AD5360 and similar DACs
+title: Analog Devices AD5360 and similar SPI DACs
maintainers:
- Michael Hennerich <michael.hennerich@analog.com>
@@ -12,41 +12,22 @@ maintainers:
properties:
compatible:
- oneOf:
- - description: SPI devices
- enum:
- - adi,ad5310r
- - adi,ad5672r
- - adi,ad5674r
- - adi,ad5676
- - adi,ad5676r
- - adi,ad5679r
- - adi,ad5681r
- - adi,ad5682r
- - adi,ad5683
- - adi,ad5683r
- - adi,ad5684
- - adi,ad5684r
- - adi,ad5685r
- - adi,ad5686
- - adi,ad5686r
- - description: I2C devices
- enum:
- - adi,ad5311r
- - adi,ad5337r
- - adi,ad5338r
- - adi,ad5671r
- - adi,ad5675r
- - adi,ad5691r
- - adi,ad5692r
- - adi,ad5693
- - adi,ad5693r
- - adi,ad5694
- - adi,ad5694r
- - adi,ad5695r
- - adi,ad5696
- - adi,ad5696r
-
+ enum:
+ - adi,ad5310r
+ - adi,ad5672r
+ - adi,ad5674r
+ - adi,ad5676
+ - adi,ad5676r
+ - adi,ad5679r
+ - adi,ad5681r
+ - adi,ad5682r
+ - adi,ad5683
+ - adi,ad5683r
+ - adi,ad5684
+ - adi,ad5684r
+ - adi,ad5685r
+ - adi,ad5686
+ - adi,ad5686r
reg:
maxItems: 1
diff --git a/Documentation/devicetree/bindings/iio/dac/adi,ad5696.yaml b/Documentation/devicetree/bindings/iio/dac/adi,ad5696.yaml
index 56b0cda0f30a..b5a88b03dc2f 100644
--- a/Documentation/devicetree/bindings/iio/dac/adi,ad5696.yaml
+++ b/Documentation/devicetree/bindings/iio/dac/adi,ad5696.yaml
@@ -4,7 +4,7 @@
$id: http://devicetree.org/schemas/iio/dac/adi,ad5696.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
-title: Analog Devices AD5696 and similar multi-channel DACs
+title: Analog Devices AD5696 and similar I2C multi-channel DACs
maintainers:
- Michael Auchter <michael.auchter@ni.com>
@@ -16,6 +16,7 @@ properties:
compatible:
enum:
- adi,ad5311r
+ - adi,ad5337r
- adi,ad5338r
- adi,ad5671r
- adi,ad5675r
diff --git a/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml b/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml
index 199b34fdbefc..7ff4efc4758a 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml
@@ -82,9 +82,6 @@ allOf:
enum:
- fsl,ls1043a-extirq
- fsl,ls1046a-extirq
- - fsl,ls1088a-extirq
- - fsl,ls2080a-extirq
- - fsl,lx2160a-extirq
then:
properties:
interrupt-map:
@@ -95,6 +92,29 @@ allOf:
- const: 0xf
- const: 0
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - fsl,ls1088a-extirq
+ - fsl,ls2080a-extirq
+ - fsl,lx2160a-extirq
+# The driver(drivers/irqchip/irq-ls-extirq.c) have not use standard DT
+# function to parser interrupt-map. So it doesn't consider '#address-size'
+# in parent interrupt controller, such as GIC.
+#
+# When dt-binding verify interrupt-map, item data matrix is spitted at
+# incorrect position. Remove interrupt-map restriction because it always
+# wrong.
+
+ then:
+ properties:
+ interrupt-map-mask:
+ items:
+ - const: 0xf
+ - const: 0
+
additionalProperties: false
examples:
diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
index 01b00d89a992..df45ff56d444 100644
--- a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
+++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
@@ -113,7 +113,7 @@ properties:
msi-parent:
deprecated: true
- $ref: /schemas/types.yaml#/definitions/phandle
+ maxItems: 1
description:
Describes the MSI controller node handling message
interrupts for the MC. When there is no translation
diff --git a/Documentation/devicetree/bindings/net/brcm,unimac-mdio.yaml b/Documentation/devicetree/bindings/net/brcm,unimac-mdio.yaml
index 23dfe0838dca..63bee5b542f5 100644
--- a/Documentation/devicetree/bindings/net/brcm,unimac-mdio.yaml
+++ b/Documentation/devicetree/bindings/net/brcm,unimac-mdio.yaml
@@ -26,6 +26,7 @@ properties:
- brcm,asp-v2.1-mdio
- brcm,asp-v2.2-mdio
- brcm,unimac-mdio
+ - brcm,bcm6846-mdio
reg:
minItems: 1
diff --git a/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml
index e95c21628281..fb02e579463c 100644
--- a/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml
+++ b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml
@@ -61,7 +61,7 @@ properties:
- gmii
- rgmii
- sgmii
- - 1000BaseX
+ - 1000base-x
xlnx,phy-type:
description:
diff --git a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml
index dcf4fa55fbba..380a9222a51d 100644
--- a/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml
+++ b/Documentation/devicetree/bindings/phy/qcom,sc8280xp-qmp-pcie-phy.yaml
@@ -154,8 +154,6 @@ allOf:
- qcom,sm8550-qmp-gen4x2-pcie-phy
- qcom,sm8650-qmp-gen3x2-pcie-phy
- qcom,sm8650-qmp-gen4x2-pcie-phy
- - qcom,x1e80100-qmp-gen3x2-pcie-phy
- - qcom,x1e80100-qmp-gen4x2-pcie-phy
then:
properties:
clocks:
@@ -171,6 +169,8 @@ allOf:
- qcom,sc8280xp-qmp-gen3x1-pcie-phy
- qcom,sc8280xp-qmp-gen3x2-pcie-phy
- qcom,sc8280xp-qmp-gen3x4-pcie-phy
+ - qcom,x1e80100-qmp-gen3x2-pcie-phy
+ - qcom,x1e80100-qmp-gen4x2-pcie-phy
- qcom,x1e80100-qmp-gen4x4-pcie-phy
then:
properties:
@@ -201,6 +201,7 @@ allOf:
- qcom,sm8550-qmp-gen4x2-pcie-phy
- qcom,sm8650-qmp-gen4x2-pcie-phy
- qcom,x1e80100-qmp-gen4x2-pcie-phy
+ - qcom,x1e80100-qmp-gen4x4-pcie-phy
then:
properties:
resets:
diff --git a/Documentation/devicetree/bindings/sound/davinci-mcasp-audio.yaml b/Documentation/devicetree/bindings/sound/davinci-mcasp-audio.yaml
index ab3206ffa4af..beef193aaaeb 100644
--- a/Documentation/devicetree/bindings/sound/davinci-mcasp-audio.yaml
+++ b/Documentation/devicetree/bindings/sound/davinci-mcasp-audio.yaml
@@ -102,21 +102,21 @@ properties:
default: 2
interrupts:
- oneOf:
- - minItems: 1
- items:
- - description: TX interrupt
- - description: RX interrupt
- - items:
- - description: common/combined interrupt
+ minItems: 1
+ maxItems: 2
interrupt-names:
oneOf:
- - minItems: 1
+ - description: TX interrupt
+ const: tx
+ - description: RX interrupt
+ const: rx
+ - description: TX and RX interrupts
items:
- const: tx
- const: rx
- - const: common
+ - description: Common/combined interrupt
+ const: common
fck_parent:
$ref: /schemas/types.yaml#/definitions/string
diff --git a/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml b/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml
index ecf3d7d968c8..2cf229a076f0 100644
--- a/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml
+++ b/Documentation/devicetree/bindings/sound/rockchip,rk3308-codec.yaml
@@ -48,6 +48,10 @@ properties:
- const: mclk_rx
- const: hclk
+ port:
+ $ref: audio-graph-port.yaml#
+ unevaluatedProperties: false
+
resets:
maxItems: 1
diff --git a/Documentation/devicetree/bindings/trivial-devices.yaml b/Documentation/devicetree/bindings/trivial-devices.yaml
index 0108d7507215..9bf0fb17a05e 100644
--- a/Documentation/devicetree/bindings/trivial-devices.yaml
+++ b/Documentation/devicetree/bindings/trivial-devices.yaml
@@ -101,8 +101,6 @@ properties:
- domintech,dmard09
# DMARD10: 3-axis Accelerometer
- domintech,dmard10
- # Elgin SPI-controlled LCD
- - elgin,jg10309-01
# MMA7660FC: 3-Axis Orientation/Motion Detection Sensor
- fsl,mma7660
# MMA8450Q: Xtrinsic Low-power, 3-axis Xtrinsic Accelerometer
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst
index e04a27bdbe19..b3ccc782cb3b 100644
--- a/Documentation/filesystems/caching/cachefiles.rst
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available:
This mask can also be set through sysfs, eg::
- echo 5 >/sys/modules/cachefiles/parameters/debug
+ echo 5 > /sys/module/cachefiles/parameters/debug
Starting the Cache
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index e8e496d23e1d..44e9e77ffe0d 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,6 +29,7 @@ algorithms work.
fiemap
files
locks
+ multigrain-ts
mount_api
quota
seq_file
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 8e6c721d2330..ef082e5a4e0c 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -208,7 +208,7 @@ The filesystem must arrange to `cancel
such `reservations
<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
because writeback will not consume the reservation.
-The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
+The ``iomap_write_delalloc_release`` can be called from a
``->iomap_end`` function to find all the clean areas of the folios
caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
It takes the ``invalidate_lock``.
@@ -513,6 +513,21 @@ IOMAP_WRITE`` with any combination of the following enhancements:
if the mapping is unwritten and the filesystem cannot handle zeroing
the unaligned regions without exposing stale contents.
+ * ``IOMAP_ATOMIC``: This write is being issued with torn-write
+ protection.
+ Only a single bio can be created for the write, and the write must
+ not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
+ set.
+ The file range to write must be aligned to satisfy the requirements
+ of both the filesystem and the underlying block device's atomic
+ commit capabilities.
+ If filesystem metadata updates are required (e.g. unwritten extent
+ conversion or copy on write), all updates for the entire file range
+ must be committed atomically as well.
+ Only one space mapping is allowed per untorn write.
+ Untorn writes must be aligned to, and must not be longer than, a
+ single file block.
+
Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
calling this function.
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index 000000000000..c779e47284e8
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigrain Timestamps
+=====================
+
+Introduction
+============
+Historically, the kernel has always used coarse time values to stamp inodes.
+This value is updated every jiffy, so any change that happens within that jiffy
+will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+================
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+ The inode change time. This is stamped with the current time whenever
+ the inode's metadata is changed. Note that this value is not settable
+ from userland.
+
+mtime:
+ The inode modification time. This is stamped with the current time
+ any time a file's contents change.
+
+atime:
+ The inode access time. This is stamped whenever an inode's contents are
+ read. Widely considered to be a terrible mistake. Usually avoided with
+ options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+========================
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Note that the above assumes that the system doesn't experience a backward jump
+of the realtime clock. If that occurs at an inopportune time, then timestamps
+can appear to go backward, even on a properly functioning system.
+
+Multigrain Timestamp Implementation
+===================================
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=====================
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and then a second file modified later could get a coarse-grained one
+that appears earlier than the first, which would break the kernel's timestamp
+ordering guarantees.
+
+To mitigate this problem, maintain a global floor value that ensures that
+this can't happen. The two files in the above example may appear to have been
+modified at the same time in such a case, but they will never show the reverse
+order. To avoid problems with realtime clock jumps, the floor is managed as a
+monotonic ktime_t, and the values are converted to realtime clock values as
+needed.
+
+Implementation Notes
+====================
+Multigrain timestamps are intended for use by local filesystems that get
+ctime values from the local clock. This is in contrast to network filesystems
+and the like that just mirror timestamp values from a server.
+
+For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
+fstype->fs_flags in order to opt-in, providing the ctime is only ever set via
+inode_set_ctime_current(). If the filesystem has a ->getattr routine that
+doesn't call generic_fillattr, then it should call fill_mg_cmtime() to
+fill those values. For setattr, it should use setattr_copy() to update the
+timestamps, or otherwise mimic its behavior.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index f0d2cb257bb8..73f0bfd7e903 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -592,4 +592,3 @@ API Function Reference
.. kernel-doc:: include/linux/netfs.h
.. kernel-doc:: fs/netfs/buffered_read.c
-.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index f04ce1215a03..de64d2d002a2 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -238,10 +238,3 @@ following flags are defined:
all of an inode's dirty data on last close. Exports that behave this
way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
waiting for writeback when closing such files.
-
- EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
- requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
- it's own ->lock() functionality as core posix_lock_file() implementation
- has no async lock request handling yet. For more information about how to
- indicate an async lock request from a ->lock() file_operations struct, see
- fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 343644712340..4c8387e1c880 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -440,6 +440,23 @@ For example::
fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+Specifying layers via file descriptors
+--------------------------------------
+
+Since kernel v6.13, overlayfs supports specifying layers via file descriptors in
+addition to specifying them as paths. This feature is available for the
+"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the
+fsconfig syscall from the new mount api::
+
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper);
+
+
fs-verity support
-----------------
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 56a26c843dbe..d677e0428c3f 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -241,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
+tmpfs has the following mounting options for case-insensitive lookup support:
+
+================= ==============================================================
+casefold Enable casefold support at this mount point using the given
+ argument as the encoding standard. Currently only UTF-8
+ encodings are supported. If no argument is used, it will load
+ the latest UTF-8 encoding available.
+strict_encoding Enable strict encoding at this mount point (disabled by
+ default). In this mode, the filesystem refuses to create file
+ and directory with names containing invalid UTF-8 characters.
+================= ==============================================================
+
+This option doesn't render the entire filesystem case-insensitive. One needs to
+still set the casefold flag per directory, by flipping the +F attribute in an
+empty directory. Nevertheless, new directories will inherit the attribute. The
+mountpoint itself cannot be made case-insensitive.
+
+Example::
+
+ $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs
+ $ mount -t tmpfs -o casefold fs_name /mytmpfs
+
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
@@ -250,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
KOSAKI Motohiro, 16 Mar 2010
:Updated:
Chris Down, 13 July 2020
+:Updated:
+ André Almeida, 23 Aug 2024
diff --git a/Documentation/iio/ad7380.rst b/Documentation/iio/ad7380.rst
index 9c784c1e652e..6f70b49b9ef2 100644
--- a/Documentation/iio/ad7380.rst
+++ b/Documentation/iio/ad7380.rst
@@ -41,13 +41,22 @@ supports only 1 SDO line.
Reference voltage
-----------------
-2 possible reference voltage sources are supported:
+ad7380-4
+~~~~~~~~
+
+ad7380-4 supports only an external reference voltage (2.5V to 3.3V). It must be
+declared in the device tree as ``refin-supply``.
+
+All other devices from ad738x family
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All other devices from ad738x support 2 possible reference voltage sources:
- Internal reference (2.5V)
- External reference (2.5V to 3.3V)
The source is determined by the device tree. If ``refio-supply`` is present,
-then the external reference is used, else the internal reference is used.
+then it is used as external reference, else the internal reference is used.
Oversampling and resolution boost
---------------------------------
diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst
index 2365c9a3c1f0..ce3e98458339 100644
--- a/Documentation/mm/damon/maintainer-profile.rst
+++ b/Documentation/mm/damon/maintainer-profile.rst
@@ -7,26 +7,26 @@ The DAMON subsystem covers the files that are listed in 'DATA ACCESS MONITOR'
section of 'MAINTAINERS' file.
The mailing lists for the subsystem are damon@lists.linux.dev and
-linux-mm@kvack.org. Patches should be made against the mm-unstable `tree
-<https://git.kernel.org/akpm/mm/h/mm-unstable>` whenever possible and posted to
-the mailing lists.
+linux-mm@kvack.org. Patches should be made against the `mm-unstable tree
+<https://git.kernel.org/akpm/mm/h/mm-unstable>`_ whenever possible and posted
+to the mailing lists.
SCM Trees
---------
There are multiple Linux trees for DAMON development. Patches under
development or testing are queued in `damon/next
-<https://git.kernel.org/sj/h/damon/next>` by the DAMON maintainer.
+<https://git.kernel.org/sj/h/damon/next>`_ by the DAMON maintainer.
Sufficiently reviewed patches will be queued in `mm-unstable
-<https://git.kernel.org/akpm/mm/h/mm-unstable>` by the memory management
+<https://git.kernel.org/akpm/mm/h/mm-unstable>`_ by the memory management
subsystem maintainer. After more sufficient tests, the patches will be queued
-in `mm-stable <https://git.kernel.org/akpm/mm/h/mm-stable>` , and finally
+in `mm-stable <https://git.kernel.org/akpm/mm/h/mm-stable>`_, and finally
pull-requested to the mainline by the memory management subsystem maintainer.
-Note again the patches for mm-unstable `tree
-<https://git.kernel.org/akpm/mm/h/mm-unstable>` are queued by the memory
+Note again the patches for `mm-unstable tree
+<https://git.kernel.org/akpm/mm/h/mm-unstable>`_ are queued by the memory
management subsystem maintainer. If the patches requires some patches in
-damon/next `tree <https://git.kernel.org/sj/h/damon/next>` which not yet merged
+`damon/next tree <https://git.kernel.org/sj/h/damon/next>`_ which not yet merged
in mm-unstable, please make sure the requirement is clearly specified.
Submit checklist addendum
@@ -37,25 +37,25 @@ When making DAMON changes, you should do below.
- Build changes related outputs including kernel and documents.
- Ensure the builds introduce no new errors or warnings.
- Run and ensure no new failures for DAMON `selftests
- <https://github.com/awslabs/damon-tests/blob/master/corr/run.sh#L49>` and
+ <https://github.com/damonitor/damon-tests/blob/master/corr/run.sh#L49>`_ and
`kunittests
- <https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh>`.
+ <https://github.com/damonitor/damon-tests/blob/master/corr/tests/kunit.sh>`_.
Further doing below and putting the results will be helpful.
- Run `damon-tests/corr
- <https://github.com/awslabs/damon-tests/tree/master/corr>` for normal
+ <https://github.com/damonitor/damon-tests/tree/master/corr>`_ for normal
changes.
- Run `damon-tests/perf
- <https://github.com/awslabs/damon-tests/tree/master/perf>` for performance
+ <https://github.com/damonitor/damon-tests/tree/master/perf>`_ for performance
changes.
Key cycle dates
---------------
Patches can be sent anytime. Key cycle dates of the `mm-unstable
-<https://git.kernel.org/akpm/mm/h/mm-unstable>` and `mm-stable
-<https://git.kernel.org/akpm/mm/h/mm-stable>` trees depend on the memory
+<https://git.kernel.org/akpm/mm/h/mm-unstable>`_ and `mm-stable
+<https://git.kernel.org/akpm/mm/h/mm-stable>`_ trees depend on the memory
management subsystem maintainer.
Review cadence
@@ -72,13 +72,13 @@ Mailing tool
Like many other Linux kernel subsystems, DAMON uses the mailing lists
(damon@lists.linux.dev and linux-mm@kvack.org) as the major communication
channel. There is a simple tool called `HacKerMaiL
-<https://github.com/damonitor/hackermail>` (``hkml``), which is for people who
+<https://github.com/damonitor/hackermail>`_ (``hkml``), which is for people who
are not very familiar with the mailing lists based communication. The tool
could be particularly helpful for DAMON community members since it is developed
and maintained by DAMON maintainer. The tool is also officially announced to
support DAMON and general Linux kernel development workflow.
-In other words, `hkml <https://github.com/damonitor/hackermail>` is a mailing
+In other words, `hkml <https://github.com/damonitor/hackermail>`_ is a mailing
tool for DAMON community, which DAMON maintainer is committed to support.
Please feel free to try and report issues or feature requests for the tool to
the maintainer.
@@ -98,8 +98,8 @@ slots, and attendees should reserve one of those at least 24 hours before the
time slot, by reaching out to the maintainer.
Schedules and available reservation time slots are available at the Google `doc
-<https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing>`.
+<https://docs.google.com/document/d/1v43Kcj3ly4CYqmAkMaZzLiM2GEnWfgdGbZAH3mi2vpM/edit?usp=sharing>`_.
There is also a public Google `calendar
-<https://calendar.google.com/calendar/u/0?cid=ZDIwOTA4YTMxNjc2MDQ3NTIyMmUzYTM5ZmQyM2U4NDA0ZGIwZjBiYmJlZGQxNDM0MmY4ZTRjOTE0NjdhZDRiY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t>`
+<https://calendar.google.com/calendar/u/0?cid=ZDIwOTA4YTMxNjc2MDQ3NTIyMmUzYTM5ZmQyM2U4NDA0ZGIwZjBiYmJlZGQxNDM0MmY4ZTRjOTE0NjdhZDRiY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t>`_
that has the events. Anyone can subscribe it. DAMON maintainer will also
provide periodic reminder to the mailing list (damon@lists.linux.dev).
diff --git a/Documentation/netlink/specs/mptcp_pm.yaml b/Documentation/netlink/specs/mptcp_pm.yaml
index 30d8342cacc8..dc190bf838fe 100644
--- a/Documentation/netlink/specs/mptcp_pm.yaml
+++ b/Documentation/netlink/specs/mptcp_pm.yaml
@@ -293,7 +293,6 @@ operations:
doc: Get endpoint information
attribute-set: attr
dont-validate: [ strict ]
- flags: [ uns-admin-perm ]
do: &get-addr-attrs
request:
attributes:
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index a55bf21f671c..d95363645331 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -225,6 +225,15 @@ The user must ensure the tokens are returned to the kernel in a timely manner.
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
and will lead to packet drops.
+The user must pass no more than 128 tokens, with no more than 1024 total frags
+among the token->token_count across all the tokens. If the user provides more
+than 1024 frags, the kernel will free up to 1024 frags and return early.
+
+The kernel returns the number of actual frags freed. The number of frags freed
+can be less than the tokens provided by the user in case of:
+
+(a) an internal kernel leak bug.
+(b) the user passed more than 1024 frags.
Implementation & Caveats
========================
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
index e4bd7aa1f5aa..544bad175aae 100644
--- a/Documentation/networking/j1939.rst
+++ b/Documentation/networking/j1939.rst
@@ -121,7 +121,7 @@ format, the Group Extension is set in the PS-field.
On the other hand, when using PDU1 format, the PS-field contains a so-called
Destination Address, which is _not_ part of the PGN. When communicating a PGN
-from user space to kernel (or vice versa) and PDU2 format is used, the PS-field
+from user space to kernel (or vice versa) and PDU1 format is used, the PS-field
of the PGN shall be set to zero. The Destination Address shall be set
elsewhere.
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
index dca15d15feaf..02370786e77b 100644
--- a/Documentation/networking/packet_mmap.rst
+++ b/Documentation/networking/packet_mmap.rst
@@ -16,7 +16,7 @@ ii) transmit network traffic, or any other that needs raw
Howto can be found at:
- https://sites.google.com/site/packetmmap/
+ https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
Please send your comments to
- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
@@ -166,7 +166,8 @@ As capture, each frame contains two parts::
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
- A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+ A complete tutorial is available at:
+ https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
By default, the user should put data at::
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
index e96e62d1dab3..d5b6d0df63c3 100644
--- a/Documentation/networking/tcp_ao.rst
+++ b/Documentation/networking/tcp_ao.rst
@@ -9,7 +9,7 @@ segments between trusted peers. It adds a new TCP header option with
a Message Authentication Code (MAC). MACs are produced from the content
of a TCP segment using a hashing function with a password known to both peers.
The intent of TCP-AO is to deprecate TCP-MD5 providing better security,
-key rotation and support for variety of hashing algorithms.
+key rotation and support for a variety of hashing algorithms.
1. Introduction
===============
@@ -164,9 +164,9 @@ A: It should not, no action needs to be performed [7.5.2.e]::
is not available, no action is required (RNextKeyID of a received
segment needs to match the MKT’s SendID).
-Q: How current_key is set and when does it change? It is a user-triggered
-change, or is it by a request from the remote peer? Is it set by the user
-explicitly, or by a matching rule?
+Q: How is current_key set, and when does it change? Is it a user-triggered
+change, or is it triggered by a request from the remote peer? Is it set by the
+user explicitly, or by a matching rule?
A: current_key is set by RNextKeyID [6.1]::
@@ -233,8 +233,8 @@ always have one current_key [3.3]::
Q: Can a non-TCP-AO connection become a TCP-AO-enabled one?
-A: No: for already established non-TCP-AO connection it would be impossible
-to switch using TCP-AO as the traffic key generation requires the initial
+A: No: for an already established non-TCP-AO connection it would be impossible
+to switch to using TCP-AO, as the traffic key generation requires the initial
sequence numbers. Paraphrasing, starting using TCP-AO would require
re-establishing the TCP connection.
@@ -292,7 +292,7 @@ no transparency is really needed and modern BGP daemons already have
Linux provides a set of ``setsockopt()s`` and ``getsockopt()s`` that let
userspace manage TCP-AO on a per-socket basis. In order to add/delete MKTs
-``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used
+``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used.
It is not allowed to add a key on an established non-TCP-AO connection
as well as to remove the last key from TCP-AO connection.
@@ -361,7 +361,7 @@ not implemented.
4. ``setsockopt()`` vs ``accept()`` race
========================================
-In contrast with TCP-MD5 established connection which has just one key,
+In contrast with an established TCP-MD5 connection which has just one key,
TCP-AO connections may have many keys, which means that accepted connections
on a listen socket may have any amount of keys as well. As copying all those
keys on a first properly signed SYN would make the request socket bigger, that
@@ -374,7 +374,7 @@ keys from sockets that were already established, but not yet ``accept()``'ed,
hanging in the accept queue.
The reverse is valid as well: if userspace adds a new key for a peer on
-a listener socket, the established sockets in accept queue won't
+a listener socket, the established sockets in the accept queue won't
have the new keys.
At this moment, the resolution for the two races:
@@ -382,7 +382,7 @@ At this moment, the resolution for the two races:
and ``setsockopt(TCP_AO_DEL_KEY)`` vs ``accept()`` is delegated to userspace.
This means that it's expected that userspace would check the MKTs on the socket
that was returned by ``accept()`` to verify that any key rotation that
-happened on listen socket is reflected on the newly established connection.
+happened on the listen socket is reflected on the newly established connection.
This is a similar "do-nothing" approach to TCP-MD5 from the kernel side and
may be changed later by introducing new flags to ``tcp_ao_add``
diff --git a/Documentation/process/maintainer-netdev.rst b/Documentation/process/maintainer-netdev.rst
index c9edf9e7362d..1ae71e31591c 100644
--- a/Documentation/process/maintainer-netdev.rst
+++ b/Documentation/process/maintainer-netdev.rst
@@ -355,6 +355,8 @@ just do it. As a result, a sequence of smaller series gets merged quicker and
with better review coverage. Re-posting large series also increases the mailing
list traffic.
+.. _rcs:
+
Local variable ordering ("reverse xmas tree", "RCS")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -391,6 +393,21 @@ APIs and helpers, especially scoped iterators. However, direct use of
``__free()`` within networking core and drivers is discouraged.
Similar guidance applies to declaring variables mid-function.
+Clean-up patches
+~~~~~~~~~~~~~~~~
+
+Netdev discourages patches which perform simple clean-ups, which are not in
+the context of other work. For example:
+
+* Addressing ``checkpatch.pl`` warnings
+* Addressing :ref:`Local variable ordering<rcs>` issues
+* Conversions to device-managed APIs (``devm_`` helpers)
+
+This is because it is felt that the churn that such changes produce comes
+at a greater cost than the value of such clean-ups.
+
+Conversely, spelling and grammar fixes are not discouraged.
+
Resending after review
~~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/process/maintainer-soc.rst b/Documentation/process/maintainer-soc.rst
index 12637530d68f..fe9d8bcfbd2b 100644
--- a/Documentation/process/maintainer-soc.rst
+++ b/Documentation/process/maintainer-soc.rst
@@ -30,10 +30,13 @@ tree as a dedicated branch covering multiple subsystems.
The main SoC tree is housed on git.kernel.org:
https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git/
+Maintainers
+-----------
+
Clearly this is quite a wide range of topics, which no one person, or even
small group of people are capable of maintaining. Instead, the SoC subsystem
-is comprised of many submaintainers, each taking care of individual platforms
-and driver subdirectories.
+is comprised of many submaintainers (platform maintainers), each taking care of
+individual platforms and driver subdirectories.
In this regard, "platform" usually refers to a series of SoCs from a given
vendor, for example, Nvidia's series of Tegra SoCs. Many submaintainers operate
on a vendor level, responsible for multiple product lines. For several reasons,
@@ -43,14 +46,43 @@ MAINTAINERS file.
Most of these submaintainers have their own trees where they stage patches,
sending pull requests to the main SoC tree. These trees are usually, but not
-always, listed in MAINTAINERS. The main SoC maintainers can be reached via the
-alias soc@kernel.org if there is no platform-specific maintainer, or if they
-are unresponsive.
+always, listed in MAINTAINERS.
What the SoC tree is not, however, is a location for architecture-specific code
changes. Each architecture has its own maintainers that are responsible for
architectural details, CPU errata and the like.
+Submitting Patches for Given SoC
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All typical platform related patches should be sent via SoC submaintainers
+(platform-specific maintainers). This includes also changes to per-platform or
+shared defconfigs (scripts/get_maintainer.pl might not provide correct
+addresses in such case).
+
+Submitting Patches to the Main SoC Maintainers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The main SoC maintainers can be reached via the alias soc@kernel.org only in
+following cases:
+
+1. There are no platform-specific maintainers.
+
+2. Platform-specific maintainers are unresponsive.
+
+3. Introducing a completely new SoC platform. Such new SoC work should be sent
+ first to common mailing lists, pointed out by scripts/get_maintainer.pl, for
+ community review. After positive community review, work should be sent to
+ soc@kernel.org in one patchset containing new arch/foo/Kconfig entry, DTS
+ files, MAINTAINERS file entry and optionally initial drivers with their
+ Devicetree bindings. The MAINTAINERS file entry should list new
+ platform-specific maintainers, who are going to be responsible for handling
+ patches for the platform from now on.
+
+Note that the soc@kernel.org is usually not the place to discuss the patches,
+thus work sent to this address should be already considered as acceptable by
+the community.
+
Information for (new) Submaintainers
------------------------------------
diff --git a/Documentation/rust/arch-support.rst b/Documentation/rust/arch-support.rst
index 750ff371570a..54be7ddf3e57 100644
--- a/Documentation/rust/arch-support.rst
+++ b/Documentation/rust/arch-support.rst
@@ -17,7 +17,7 @@ Architecture Level of support Constraints
============= ================ ==============================================
``arm64`` Maintained Little Endian only.
``loongarch`` Maintained \-
-``riscv`` Maintained ``riscv64`` only.
+``riscv`` Maintained ``riscv64`` and LLVM/Clang only.
``um`` Maintained \-
``x86`` Maintained ``x86_64`` only.
============= ================ ==============================================
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 6c0d70e2e27d..7b59bbd2e564 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -66,7 +66,7 @@ BPF scheduler and reverts all tasks back to CFS.
.. code-block:: none
# make -j16 -C tools/sched_ext
- # tools/sched_ext/scx_simple
+ # tools/sched_ext/build/bin/scx_simple
local=0 global=3
local=5 global=24
local=9 global=44
diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst
index 36f26501fd15..59ecdb1c0d4d 100644
--- a/Documentation/security/landlock.rst
+++ b/Documentation/security/landlock.rst
@@ -11,18 +11,18 @@ Landlock LSM: kernel documentation
Landlock's goal is to create scoped access-control (i.e. sandboxing). To
harden a whole system, this feature should be available to any process,
-including unprivileged ones. Because such process may be compromised or
+including unprivileged ones. Because such a process may be compromised or
backdoored (i.e. untrusted), Landlock's features must be safe to use from the
kernel and other processes point of view. Landlock's interface must therefore
expose a minimal attack surface.
Landlock is designed to be usable by unprivileged processes while following the
system security policy enforced by other access control mechanisms (e.g. DAC,
-LSM). Indeed, a Landlock rule shall not interfere with other access-controls
-enforced on the system, only add more restrictions.
+LSM). A Landlock rule shall not interfere with other access-controls enforced
+on the system, only add more restrictions.
Any user can enforce Landlock rulesets on their processes. They are merged and
-evaluated according to the inherited ones in a way that ensures that only more
+evaluated against inherited rulesets in a way that ensures that only more
constraints can be added.
User space documentation can be found here:
@@ -43,7 +43,7 @@ Guiding principles for safe access controls
only impact the processes requesting them.
* Resources (e.g. file descriptors) directly obtained from the kernel by a
sandboxed process shall retain their scoped accesses (at the time of resource
- acquisition) whatever process use them.
+ acquisition) whatever process uses them.
Cf. `File descriptor access rights`_.
Design choices
@@ -71,7 +71,7 @@ the same results, when they are executed under the same Landlock domain.
Taking the ``LANDLOCK_ACCESS_FS_TRUNCATE`` right as an example, it may be
allowed to open a file for writing without being allowed to
:manpage:`ftruncate` the resulting file descriptor if the related file
-hierarchy doesn't grant such access right. The following sequences of
+hierarchy doesn't grant that access right. The following sequences of
operations have the same semantic and should then have the same result:
* ``truncate(path);``
@@ -81,7 +81,7 @@ Similarly to file access modes (e.g. ``O_RDWR``), Landlock access rights
attached to file descriptors are retained even if they are passed between
processes (e.g. through a Unix domain socket). Such access rights will then be
enforced even if the receiving process is not sandboxed by Landlock. Indeed,
-this is required to keep a consistent access control over the whole system, and
+this is required to keep access controls consistent over the whole system, and
this avoids unattended bypasses through file descriptor passing (i.e. confused
deputy attack).
diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
index c8d3e46badc5..d639c61cb472 100644
--- a/Documentation/userspace-api/landlock.rst
+++ b/Documentation/userspace-api/landlock.rst
@@ -8,13 +8,13 @@ Landlock: unprivileged access control
=====================================
:Author: Mickaël Salaün
-:Date: September 2024
+:Date: October 2024
-The goal of Landlock is to enable to restrict ambient rights (e.g. global
+The goal of Landlock is to enable restriction of ambient rights (e.g. global
filesystem or network access) for a set of processes. Because Landlock
-is a stackable LSM, it makes possible to create safe security sandboxes as new
-security layers in addition to the existing system-wide access-controls. This
-kind of sandbox is expected to help mitigate the security impact of bugs or
+is a stackable LSM, it makes it possible to create safe security sandboxes as
+new security layers in addition to the existing system-wide access-controls.
+This kind of sandbox is expected to help mitigate the security impact of bugs or
unexpected/malicious behaviors in user space applications. Landlock empowers
any process, including unprivileged ones, to securely restrict themselves.
@@ -86,8 +86,8 @@ to be explicit about the denied-by-default access rights.
LANDLOCK_SCOPE_SIGNAL,
};
-Because we may not know on which kernel version an application will be
-executed, it is safer to follow a best-effort security approach. Indeed, we
+Because we may not know which kernel version an application will be executed
+on, it is safer to follow a best-effort security approach. Indeed, we
should try to protect users as much as possible whatever the kernel they are
using.
@@ -129,7 +129,7 @@ version, and only use the available subset of access rights:
LANDLOCK_SCOPE_SIGNAL);
}
-This enables to create an inclusive ruleset that will contain our rules.
+This enables the creation of an inclusive ruleset that will contain our rules.
.. code-block:: c
@@ -219,42 +219,41 @@ If the ``landlock_restrict_self`` system call succeeds, the current thread is
now restricted and this policy will be enforced on all its subsequently created
children as well. Once a thread is landlocked, there is no way to remove its
security policy; only adding more restrictions is allowed. These threads are
-now in a new Landlock domain, merge of their parent one (if any) with the new
-ruleset.
+now in a new Landlock domain, which is a merger of their parent one (if any)
+with the new ruleset.
Full working code can be found in `samples/landlock/sandboxer.c`_.
Good practices
--------------
-It is recommended setting access rights to file hierarchy leaves as much as
+It is recommended to set access rights to file hierarchy leaves as much as
possible. For instance, it is better to be able to have ``~/doc/`` as a
read-only hierarchy and ``~/tmp/`` as a read-write hierarchy, compared to
``~/`` as a read-only hierarchy and ``~/tmp/`` as a read-write hierarchy.
Following this good practice leads to self-sufficient hierarchies that do not
depend on their location (i.e. parent directories). This is particularly
relevant when we want to allow linking or renaming. Indeed, having consistent
-access rights per directory enables to change the location of such directory
+access rights per directory enables changing the location of such directories
without relying on the destination directory access rights (except those that
are required for this operation, see ``LANDLOCK_ACCESS_FS_REFER``
documentation).
Having self-sufficient hierarchies also helps to tighten the required access
rights to the minimal set of data. This also helps avoid sinkhole directories,
-i.e. directories where data can be linked to but not linked from. However,
+i.e. directories where data can be linked to but not linked from. However,
this depends on data organization, which might not be controlled by developers.
In this case, granting read-write access to ``~/tmp/``, instead of write-only
-access, would potentially allow to move ``~/tmp/`` to a non-readable directory
+access, would potentially allow moving ``~/tmp/`` to a non-readable directory
and still keep the ability to list the content of ``~/tmp/``.
Layers of file path access rights
---------------------------------
Each time a thread enforces a ruleset on itself, it updates its Landlock domain
-with a new layer of policy. Indeed, this complementary policy is stacked with
-the potentially other rulesets already restricting this thread. A sandboxed
-thread can then safely add more constraints to itself with a new enforced
-ruleset.
+with a new layer of policy. This complementary policy is stacked with any
+other rulesets potentially already restricting this thread. A sandboxed thread
+can then safely add more constraints to itself with a new enforced ruleset.
One policy layer grants access to a file path if at least one of its rules
encountered on the path grants the access. A sandboxed thread can only access
@@ -265,7 +264,7 @@ etc.).
Bind mounts and OverlayFS
-------------------------
-Landlock enables to restrict access to file hierarchies, which means that these
+Landlock enables restricting access to file hierarchies, which means that these
access rights can be propagated with bind mounts (cf.
Documentation/filesystems/sharedsubtree.rst) but not with
Documentation/filesystems/overlayfs.rst.
@@ -278,21 +277,21 @@ access to multiple file hierarchies at the same time, whether these hierarchies
are the result of bind mounts or not.
An OverlayFS mount point consists of upper and lower layers. These layers are
-combined in a merge directory, result of the mount point. This merge hierarchy
-may include files from the upper and lower layers, but modifications performed
-on the merge hierarchy only reflects on the upper layer. From a Landlock
-policy point of view, each OverlayFS layers and merge hierarchies are
-standalone and contains their own set of files and directories, which is
-different from bind mounts. A policy restricting an OverlayFS layer will not
-restrict the resulted merged hierarchy, and vice versa. Landlock users should
-then only think about file hierarchies they want to allow access to, regardless
-of the underlying filesystem.
+combined in a merge directory, and that merged directory becomes available at
+the mount point. This merge hierarchy may include files from the upper and
+lower layers, but modifications performed on the merge hierarchy only reflect
+on the upper layer. From a Landlock policy point of view, all OverlayFS layers
+and merge hierarchies are standalone and each contains their own set of files
+and directories, which is different from bind mounts. A policy restricting an
+OverlayFS layer will not restrict the resulted merged hierarchy, and vice versa.
+Landlock users should then only think about file hierarchies they want to allow
+access to, regardless of the underlying filesystem.
Inheritance
-----------
Every new thread resulting from a :manpage:`clone(2)` inherits Landlock domain
-restrictions from its parent. This is similar to the seccomp inheritance (cf.
+restrictions from its parent. This is similar to seccomp inheritance (cf.
Documentation/userspace-api/seccomp_filter.rst) or any other LSM dealing with
task's :manpage:`credentials(7)`. For instance, one process's thread may apply
Landlock rules to itself, but they will not be automatically applied to other
@@ -311,8 +310,8 @@ Ptrace restrictions
A sandboxed process has less privileges than a non-sandboxed process and must
then be subject to additional restrictions when manipulating another process.
To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
-process, a sandboxed process should have a subset of the target process rules,
-which means the tracee must be in a sub-domain of the tracer.
+process, a sandboxed process should have a superset of the target process's
+access rights, which means the tracee must be in a sub-domain of the tracer.
IPC scoping
-----------
@@ -322,7 +321,7 @@ interactions between sandboxes. Each Landlock domain can be explicitly scoped
for a set of actions by specifying it on a ruleset. For example, if a
sandboxed process should not be able to :manpage:`connect(2)` to a
non-sandboxed process through abstract :manpage:`unix(7)` sockets, we can
-specify such restriction with ``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET``.
+specify such a restriction with ``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET``.
Moreover, if a sandboxed process should not be able to send a signal to a
non-sandboxed process, we can specify this restriction with
``LANDLOCK_SCOPE_SIGNAL``.
@@ -394,7 +393,7 @@ Backward and forward compatibility
Landlock is designed to be compatible with past and future versions of the
kernel. This is achieved thanks to the system call attributes and the
associated bitflags, particularly the ruleset's ``handled_access_fs``. Making
-handled access right explicit enables the kernel and user space to have a clear
+handled access rights explicit enables the kernel and user space to have a clear
contract with each other. This is required to make sure sandboxing will not
get stricter with a system update, which could break applications.
@@ -563,33 +562,34 @@ always allowed when using a kernel that only supports the first or second ABI.
Starting with the Landlock ABI version 3, it is now possible to securely control
truncation thanks to the new ``LANDLOCK_ACCESS_FS_TRUNCATE`` access right.
-Network support (ABI < 4)
--------------------------
+TCP bind and connect (ABI < 4)
+------------------------------
Starting with the Landlock ABI version 4, it is now possible to restrict TCP
bind and connect actions to only a set of allowed ports thanks to the new
``LANDLOCK_ACCESS_NET_BIND_TCP`` and ``LANDLOCK_ACCESS_NET_CONNECT_TCP``
access rights.
-IOCTL (ABI < 5)
----------------
+Device IOCTL (ABI < 5)
+----------------------
IOCTL operations could not be denied before the fifth Landlock ABI, so
:manpage:`ioctl(2)` is always allowed when using a kernel that only supports an
earlier ABI.
Starting with the Landlock ABI version 5, it is possible to restrict the use of
-:manpage:`ioctl(2)` using the new ``LANDLOCK_ACCESS_FS_IOCTL_DEV`` right.
+:manpage:`ioctl(2)` on character and block devices using the new
+``LANDLOCK_ACCESS_FS_IOCTL_DEV`` right.
-Abstract UNIX socket scoping (ABI < 6)
---------------------------------------
+Abstract UNIX socket (ABI < 6)
+------------------------------
Starting with the Landlock ABI version 6, it is possible to restrict
connections to an abstract :manpage:`unix(7)` socket by setting
``LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET`` to the ``scoped`` ruleset attribute.
-Signal scoping (ABI < 6)
-------------------------
+Signal (ABI < 6)
+----------------
Starting with the Landlock ABI version 6, it is possible to restrict
:manpage:`signal(7)` sending by setting ``LANDLOCK_SCOPE_SIGNAL`` to the
@@ -605,9 +605,9 @@ Build time configuration
Landlock was first introduced in Linux 5.13 but it must be configured at build
time with ``CONFIG_SECURITY_LANDLOCK=y``. Landlock must also be enabled at boot
-time as the other security modules. The list of security modules enabled by
+time like other security modules. The list of security modules enabled by
default is set with ``CONFIG_LSM``. The kernel configuration should then
-contains ``CONFIG_LSM=landlock,[...]`` with ``[...]`` as the list of other
+contain ``CONFIG_LSM=landlock,[...]`` with ``[...]`` as the list of other
potentially useful security modules for the running system (see the
``CONFIG_LSM`` help).
@@ -669,7 +669,7 @@ Questions and answers
What about user space sandbox managers?
---------------------------------------
-Using user space process to enforce restrictions on kernel resources can lead
+Using user space processes to enforce restrictions on kernel resources can lead
to race conditions or inconsistent evaluations (i.e. `Incorrect mirroring of
the OS code and state
<https://www.ndss-symposium.org/ndss2003/traps-and-pitfalls-practical-problems-system-call-interposition-based-security-tools/>`_).
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
index 4132eec995a3..41102f74c5e2 100644
--- a/Documentation/userspace-api/mseal.rst
+++ b/Documentation/userspace-api/mseal.rst
@@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
-User API
-========
-mseal()
------------
-The mseal() syscall has the following signature:
-
-``int mseal(void addr, size_t len, unsigned long flags)``
-
-**addr/len**: virtual memory address range.
-
-The address range set by ``addr``/``len`` must meet:
- - The start address must be in an allocated VMA.
- - The start address must be page aligned.
- - The end address (``addr`` + ``len``) must be in an allocated VMA.
- - no gap (unallocated memory) between start and end address.
-
-The ``len`` will be paged aligned implicitly by the kernel.
-
-**flags**: reserved for future use.
-
-**return values**:
-
-- ``0``: Success.
-
-- ``-EINVAL``:
- - Invalid input ``flags``.
- - The start address (``addr``) is not page aligned.
- - Address range (``addr`` + ``len``) overflow.
-
-- ``-ENOMEM``:
- - The start address (``addr``) is not allocated.
- - The end address (``addr`` + ``len``) is not allocated.
- - A gap (unallocated memory) between start and end address.
-
-- ``-EPERM``:
- - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
-
-- For above error cases, users can expect the given memory range is
- unmodified, i.e. no partial update.
-
-- There might be other internal errors/cases not listed here, e.g.
- error during merging/splitting VMAs, or the process reaching the max
- number of supported VMAs. In those cases, partial updates to the given
- memory range could happen. However, those cases should be rare.
-
-**Blocked operations after sealing**:
- Unmapping, moving to another location, and shrinking the size,
- via munmap() and mremap(), can leave an empty space, therefore
- can be replaced with a VMA with a new set of attributes.
-
- Moving or expanding a different VMA into the current location,
- via mremap().
-
- Modifying a VMA via mmap(MAP_FIXED).
-
- Size expansion, via mremap(), does not appear to pose any
- specific risks to sealed VMAs. It is included anyway because
- the use case is unclear. In any case, users can rely on
- merging to expand a sealed VMA.
-
- mprotect() and pkey_mprotect().
-
- Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
- for anonymous memory, when users don't have write permission to the
- memory. Those behaviors can alter region contents by discarding pages,
- effectively a memset(0) for anonymous memory.
-
- Kernel will return -EPERM for blocked operations.
-
- For blocked operations, one can expect the given address is unmodified,
- i.e. no partial update. Note, this is different from existing mm
- system call behaviors, where partial updates are made till an error is
- found and returned to userspace. To give an example:
-
- Assume following code sequence:
-
- - ptr = mmap(null, 8192, PROT_NONE);
- - munmap(ptr + 4096, 4096);
- - ret1 = mprotect(ptr, 8192, PROT_READ);
- - mseal(ptr, 4096);
- - ret2 = mprotect(ptr, 8192, PROT_NONE);
-
- ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
-
- ret2 will be -EPERM, the page remains to be PROT_READ.
-
-**Note**:
-
-- mseal() only works on 64-bit CPUs, not 32-bit CPU.
-
-- users can call mseal() multiple times, mseal() on an already sealed memory
- is a no-action (not error).
-
-- munseal() is not supported.
-
-Use cases:
-==========
+SYSCALL
+=======
+mseal syscall signature
+-----------------------
+ ``int mseal(void \* addr, size_t len, unsigned long flags)``
+
+ **addr**/**len**: virtual memory address range.
+ The address range set by **addr**/**len** must meet:
+ - The start address must be in an allocated VMA.
+ - The start address must be page aligned.
+ - The end address (**addr** + **len**) must be in an allocated VMA.
+ - no gap (unallocated memory) between start and end address.
+
+ The ``len`` will be paged aligned implicitly by the kernel.
+
+ **flags**: reserved for future use.
+
+ **Return values**:
+ - **0**: Success.
+ - **-EINVAL**:
+ * Invalid input ``flags``.
+ * The start address (``addr``) is not page aligned.
+ * Address range (``addr`` + ``len``) overflow.
+ - **-ENOMEM**:
+ * The start address (``addr``) is not allocated.
+ * The end address (``addr`` + ``len``) is not allocated.
+ * A gap (unallocated memory) between start and end address.
+ - **-EPERM**:
+ * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
+
+ **Note about error return**:
+ - For above error cases, users can expect the given memory range is
+ unmodified, i.e. no partial update.
+ - There might be other internal errors/cases not listed here, e.g.
+ error during merging/splitting VMAs, or the process reaching the maximum
+ number of supported VMAs. In those cases, partial updates to the given
+ memory range could happen. However, those cases should be rare.
+
+ **Architecture support**:
+ mseal only works on 64-bit CPUs, not 32-bit CPUs.
+
+ **Idempotent**:
+ users can call mseal multiple times. mseal on an already sealed memory
+ is a no-action (not error).
+
+ **no munseal**
+ Once mapping is sealed, it can't be unsealed. The kernel should never
+ have munseal, this is consistent with other sealing feature, e.g.
+ F_SEAL_SEAL for file.
+
+Blocked mm syscall for sealed mapping
+-------------------------------------
+ It might be important to note: **once the mapping is sealed, it will
+ stay in the process's memory until the process terminates**.
+
+ Example::
+
+ *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+ rc = mseal(ptr, 4096, 0);
+ /* munmap will fail */
+ rc = munmap(ptr, 4096);
+ assert(rc < 0);
+
+ Blocked mm syscall:
+ - munmap
+ - mmap
+ - mremap
+ - mprotect and pkey_mprotect
+ - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
+ MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
+
+ The first set of syscalls to block is munmap, mremap, mmap. They can
+ either leave an empty space in the address space, therefore allowing
+ replacement with a new mapping with new set of attributes, or can
+ overwrite the existing mapping with another mapping.
+
+ mprotect and pkey_mprotect are blocked because they changes the
+ protection bits (RWX) of the mapping.
+
+ Certain destructive madvise behaviors, specifically MADV_DONTNEED,
+ MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
+ risks when applied to anonymous memory by threads lacking write
+ permissions. Consequently, these operations are prohibited under such
+ conditions. The aforementioned behaviors have the potential to modify
+ region contents by discarding pages, effectively performing a memset(0)
+ operation on the anonymous memory.
+
+ Kernel will return -EPERM for blocked syscalls.
+
+ When blocked syscall return -EPERM due to sealing, the memory regions may
+ or may not be changed, depends on the syscall being blocked:
+
+ - munmap: munmap is atomic. If one of VMAs in the given range is
+ sealed, none of VMAs are updated.
+ - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
+ when mprotect over multiple VMAs, mprotect might update the beginning
+ VMAs before reaching the sealed VMA and return -EPERM.
+ - mmap and mremap: undefined behavior.
+
+Use cases
+=========
- glibc:
The dynamic linker, during loading ELF executables, can apply sealing to
- non-writable memory segments.
-
-- Chrome browser: protect some security sensitive data-structures.
+ mapping segments.
-Notes on which memory to seal:
-==============================
+- Chrome browser: protect some security sensitive data structures.
-It might be important to note that sealing changes the lifetime of a mapping,
-i.e. the sealed mapping won’t be unmapped till the process terminates or the
-exec system call is invoked. Applications can apply sealing to any virtual
-memory region from userspace, but it is crucial to thoroughly analyze the
-mapping's lifetime prior to apply the sealing.
+When not to use mseal
+=====================
+Applications can apply sealing to any virtual memory region from userspace,
+but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
+apply the sealing. This is because the sealed mapping *won’t be unmapped*
+until the process terminates or the exec system call is invoked.
For example:
+ - aio/shm
+ aio/shm can call mmap and munmap on behalf of userspace, e.g.
+ ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
+ the lifetime of the process. If those memories are sealed from userspace,
+ then munmap will fail, causing leaks in VMA address space during the
+ lifetime of the process.
+
+ - ptr allocated by malloc (heap)
+ Don't use mseal on the memory ptr return from malloc().
+ malloc() is implemented by allocator, e.g. by glibc. Heap manager might
+ allocate a ptr from brk or mapping created by mmap.
+ If an app calls mseal on a ptr returned from malloc(), this can affect
+ the heap manager's ability to manage the mappings; the outcome is
+ non-deterministic.
+
+ Example::
+
+ ptr = malloc(size);
+ /* don't call mseal on ptr return from malloc. */
+ mseal(ptr, size);
+ /* free will success, allocator can't shrink heap lower than ptr */
+ free(ptr);
+
+mseal doesn't block
+===================
+In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
+attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
+memory is immutable.
-- aio/shm
-
- aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
- shm.c. The lifetime of those mapping are not tied to the lifetime of the
- process. If those memories are sealed from userspace, then munmap() will fail,
- causing leaks in VMA address space during the lifetime of the process.
-
-- Brk (heap)
-
- Currently, userspace applications can seal parts of the heap by calling
- malloc() and mseal().
- let's assume following calls from user space:
-
- - ptr = malloc(size);
- - mprotect(ptr, size, RO);
- - mseal(ptr, size);
- - free(ptr);
-
- Technically, before mseal() is added, the user can change the protection of
- the heap by calling mprotect(RO). As long as the user changes the protection
- back to RW before free(), the memory range can be reused.
-
- Adding mseal() into the picture, however, the heap is then sealed partially,
- the user can still free it, but the memory remains to be RO. If the address
- is re-used by the heap manager for another malloc, the process might crash
- soon after. Therefore, it is important not to apply sealing to any memory
- that might get recycled.
-
- Furthermore, even if the application never calls the free() for the ptr,
- the heap manager may invoke the brk system call to shrink the size of the
- heap. In the kernel, the brk-shrink will call munmap(). Consequently,
- depending on the location of the ptr, the outcome of brk-shrink is
- nondeterministic.
-
-
-Additional notes:
-=================
As Jann Horn pointed out in [3], there are still a few ways to write
-to RO memory, which is, in a way, by design. Those cases are not covered
-by mseal(). If applications want to block such cases, sandbox tools (such as
-seccomp, LSM, etc) might be considered.
+to RO memory, which is, in a way, by design. And those could be blocked
+by different security measures.
Those cases are:
-- Write to read-only memory through /proc/self/mem interface.
-- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
-- userfaultfd.
+ - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
+ - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
+ - userfaultfd.
The idea that inspired this patch comes from Stephen Röttger’s work in V8
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
-Reference:
-==========
-[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
-
-[2] https://man.openbsd.org/mimmutable.2
-
-[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
-
-[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
+Reference
+=========
+- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+- [2] https://man.openbsd.org/mimmutable.2
+- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e32471977d0a..edc070c6e19b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8098,13 +8098,15 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
disabled.
-KVM_X86_QUIRK_SLOT_ZAP_ALL By default, KVM invalidates all SPTEs in
- fast way for memslot deletion when VM type
- is KVM_X86_DEFAULT_VM.
- When this quirk is disabled or when VM type
- is other than KVM_X86_DEFAULT_VM, KVM zaps
- only leaf SPTEs that are within the range of
- the memslot being deleted.
+KVM_X86_QUIRK_SLOT_ZAP_ALL By default, for KVM_X86_DEFAULT_VM VMs, KVM
+ invalidates all SPTEs in all memslots and
+ address spaces when a memslot is deleted or
+ moved. When this quirk is disabled (or the
+ VM type isn't KVM_X86_DEFAULT_VM), KVM only
+ ensures the backing memory of the deleted
+ or moved memslot isn't reachable, i.e KVM
+ _may_ invalidate only SPTEs related to the
+ memslot.
=================================== ============================================
7.32 KVM_CAP_MAX_VCPU_ID
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 20a9a37d1cdd..1bedd56e2fe3 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -136,7 +136,7 @@ For direct sp, we can easily avoid it since the spte of direct sp is fixed
to gfn. For indirect sp, we disabled fast page fault for simplicity.
A solution for indirect sp could be to pin the gfn, for example via
-kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
+gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning:
- We have held the refcount of pfn; that means the pfn can not be freed and
be reused for another gfn.