summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/PCI/pci.rst14
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt6
-rw-r--r--Documentation/core-api/xarray.rst14
-rw-r--r--Documentation/dev-tools/kunit/architecture.rst13
-rw-r--r--Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt2
-rw-r--r--Documentation/devicetree/bindings/arm/psci.yaml2
-rw-r--r--Documentation/devicetree/bindings/cpu/idle-states.yaml (renamed from Documentation/devicetree/bindings/arm/idle-states.yaml)228
-rw-r--r--Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml77
-rw-r--r--Documentation/devicetree/bindings/input/mtk-pmic-keys.txt5
-rw-r--r--Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml74
-rw-r--r--Documentation/devicetree/bindings/riscv/cpus.yaml8
-rw-r--r--Documentation/devicetree/bindings/rtc/allwinner,sun6i-a31-rtc.yaml84
-rw-r--r--Documentation/devicetree/bindings/rtc/atmel,at91sam9-rtc.txt25
-rw-r--r--Documentation/devicetree/bindings/rtc/atmel,at91sam9260-rtt.yaml69
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.yaml2
-rw-r--r--Documentation/devicetree/bindings/watchdog/renesas,wdt.yaml5
-rw-r--r--Documentation/filesystems/cifs/ksmbd.rst4
-rw-r--r--Documentation/filesystems/fsverity.rst6
-rw-r--r--Documentation/filesystems/locking.rst6
-rw-r--r--Documentation/filesystems/netfs_library.rst140
-rw-r--r--Documentation/filesystems/vfs.rst11
-rw-r--r--Documentation/kbuild/kbuild.rst11
-rw-r--r--Documentation/kbuild/llvm.rst31
-rw-r--r--Documentation/kbuild/makefiles.rst2
-rw-r--r--Documentation/locking/locktypes.rst3
-rw-r--r--Documentation/maintainer/index.rst1
-rw-r--r--Documentation/maintainer/messy-diffstat.rst96
-rw-r--r--Documentation/riscv/index.rst1
-rw-r--r--Documentation/sphinx/kernel_abi.py6
-rw-r--r--Documentation/sphinx/kernel_feat.py20
-rwxr-xr-xDocumentation/sphinx/kernel_include.py3
-rw-r--r--Documentation/sphinx/kerneldoc.py2
-rw-r--r--Documentation/sphinx/kfigure.py8
-rw-r--r--Documentation/sphinx/requirements.txt2
-rw-r--r--Documentation/trace/user_events.rst14
-rw-r--r--Documentation/virt/kvm/api.rst61
-rw-r--r--Documentation/virt/kvm/index.rst26
-rw-r--r--Documentation/virt/kvm/locking.rst43
-rw-r--r--Documentation/virt/kvm/s390/index.rst12
-rw-r--r--Documentation/virt/kvm/s390/s390-diag.rst (renamed from Documentation/virt/kvm/s390-diag.rst)0
-rw-r--r--Documentation/virt/kvm/s390/s390-pv-boot.rst (renamed from Documentation/virt/kvm/s390-pv-boot.rst)0
-rw-r--r--Documentation/virt/kvm/s390/s390-pv.rst (renamed from Documentation/virt/kvm/s390-pv.rst)0
-rw-r--r--Documentation/virt/kvm/vcpu-requests.rst10
-rw-r--r--Documentation/virt/kvm/x86/amd-memory-encryption.rst (renamed from Documentation/virt/kvm/amd-memory-encryption.rst)0
-rw-r--r--Documentation/virt/kvm/x86/cpuid.rst (renamed from Documentation/virt/kvm/cpuid.rst)0
-rw-r--r--Documentation/virt/kvm/x86/errata.rst39
-rw-r--r--Documentation/virt/kvm/x86/halt-polling.rst (renamed from Documentation/virt/kvm/halt-polling.rst)0
-rw-r--r--Documentation/virt/kvm/x86/hypercalls.rst (renamed from Documentation/virt/kvm/hypercalls.rst)0
-rw-r--r--Documentation/virt/kvm/x86/index.rst19
-rw-r--r--Documentation/virt/kvm/x86/mmu.rst (renamed from Documentation/virt/kvm/mmu.rst)0
-rw-r--r--Documentation/virt/kvm/x86/msr.rst (renamed from Documentation/virt/kvm/msr.rst)0
-rw-r--r--Documentation/virt/kvm/x86/nested-vmx.rst (renamed from Documentation/virt/kvm/nested-vmx.rst)0
-rw-r--r--Documentation/virt/kvm/x86/running-nested-guests.rst (renamed from Documentation/virt/kvm/running-nested-guests.rst)0
-rw-r--r--Documentation/virt/kvm/x86/timekeeping.rst (renamed from Documentation/virt/kvm/timekeeping.rst)0
-rw-r--r--Documentation/virt/uml/user_mode_linux_howto_v2.rst20
-rw-r--r--Documentation/vm/page_owner.rst1
-rw-r--r--Documentation/vm/unevictable-lru.rst471
57 files changed, 1223 insertions, 474 deletions
diff --git a/Documentation/PCI/pci.rst b/Documentation/PCI/pci.rst
index 87c6f4a6ca32..67a850b55617 100644
--- a/Documentation/PCI/pci.rst
+++ b/Documentation/PCI/pci.rst
@@ -278,20 +278,20 @@ appropriate parameters. In general this allows more efficient DMA
on systems where System RAM exists above 4G _physical_ address.
Drivers for all PCI-X and PCIe compliant devices must call
-pci_set_dma_mask() as they are 64-bit DMA devices.
+set_dma_mask() as they are 64-bit DMA devices.
Similarly, drivers must also "register" this capability if the device
-can directly address "consistent memory" in System RAM above 4G physical
-address by calling pci_set_consistent_dma_mask().
+can directly address "coherent memory" in System RAM above 4G physical
+address by calling dma_set_coherent_mask().
Again, this includes drivers for all PCI-X and PCIe compliant devices.
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
64-bit DMA capable for payload ("streaming") data but not control
-("consistent") data.
+("coherent") data.
Setup shared control data
-------------------------
-Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
+Once the DMA masks are set, the driver can allocate "coherent" (a.k.a. shared)
memory. See Documentation/core-api/dma-api.rst for a full description of
the DMA APIs. This section is just a reminder that it needs to be done
before enabling DMA on the device.
@@ -367,7 +367,7 @@ steps need to be performed:
- Disable the device from generating IRQs
- Release the IRQ (free_irq())
- Stop all DMA activity
- - Release DMA buffers (both streaming and consistent)
+ - Release DMA buffers (both streaming and coherent)
- Unregister from other subsystems (e.g. scsi or netdev)
- Disable device from responding to MMIO/IO Port addresses
- Release MMIO/IO Port resource(s)
@@ -420,7 +420,7 @@ Once DMA is stopped, clean up streaming DMA first.
I.e. unmap data buffers and return buffers to "upstream"
owners if there is one.
-Then clean up "consistent" buffers which contain the control data.
+Then clean up "coherent" buffers which contain the control data.
See Documentation/core-api/dma-api.rst for details on unmapping interfaces.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b7ccaa2ea867..3f1cc5e317ed 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4427,6 +4427,12 @@
fully seed the kernel's CRNG. Default is controlled
by CONFIG_RANDOM_TRUST_CPU.
+ random.trust_bootloader={on,off}
+ [KNL] Enable or disable trusting the use of a
+ seed passed by the bootloader (if available) to
+ fully seed the kernel's CRNG. Default is controlled
+ by CONFIG_RANDOM_TRUST_BOOTLOADER.
+
randomize_kstack_offset=
[KNL] Enable or disable kernel stack offset
randomization, which provides roughly 5 bits of
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index a137a0e6d068..77e0ece2b1d6 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -315,11 +315,15 @@ indeed the normal API is implemented in terms of the advanced API. The
advanced API is only available to modules with a GPL-compatible license.
The advanced API is based around the xa_state. This is an opaque data
-structure which you declare on the stack using the XA_STATE()
-macro. This macro initialises the xa_state ready to start walking
-around the XArray. It is used as a cursor to maintain the position
-in the XArray and let you compose various operations together without
-having to restart from the top every time.
+structure which you declare on the stack using the XA_STATE() macro.
+This macro initialises the xa_state ready to start walking around the
+XArray. It is used as a cursor to maintain the position in the XArray
+and let you compose various operations together without having to restart
+from the top every time. The contents of the xa_state are protected by
+the rcu_read_lock() or the xas_lock(). If you need to drop whichever of
+those locks is protecting your state and tree, you must call xas_pause()
+so that future calls do not rely on the parts of the state which were
+left unprotected.
The xa_state is also used to store errors. You can call
xas_error() to retrieve the error. All operations check whether
diff --git a/Documentation/dev-tools/kunit/architecture.rst b/Documentation/dev-tools/kunit/architecture.rst
index aa2cea821e25..ff9c85a0bff2 100644
--- a/Documentation/dev-tools/kunit/architecture.rst
+++ b/Documentation/dev-tools/kunit/architecture.rst
@@ -26,10 +26,7 @@ The fundamental unit in KUnit is the test case. The KUnit test cases are
grouped into KUnit suites. A KUnit test case is a function with type
signature ``void (*)(struct kunit *test)``.
These test case functions are wrapped in a struct called
-``struct kunit_case``. For code, see:
-
-.. kernel-doc:: include/kunit/test.h
- :identifiers: kunit_case
+struct kunit_case.
.. note:
``generate_params`` is optional for non-parameterized tests.
@@ -152,18 +149,12 @@ Parameterized Tests
Each KUnit parameterized test is associated with a collection of
parameters. The test is invoked multiple times, once for each parameter
value and the parameter is stored in the ``param_value`` field.
-The test case includes a ``KUNIT_CASE_PARAM()`` macro that accepts a
+The test case includes a KUNIT_CASE_PARAM() macro that accepts a
generator function.
The generator function is passed the previous parameter and returns the next
parameter. It also provides a macro to generate common-case generators based on
arrays.
-For code, see:
-
-.. kernel-doc:: include/kunit/test.h
- :identifiers: KUNIT_ARRAY_PARAM
-
-
kunit_tool (Command Line Test Harness)
======================================
diff --git a/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt b/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt
index 6ce0b212ec6d..606b4b1b709d 100644
--- a/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt
+++ b/Documentation/devicetree/bindings/arm/msm/qcom,idle-state.txt
@@ -81,4 +81,4 @@ Example:
};
};
-[1]. Documentation/devicetree/bindings/arm/idle-states.yaml
+[1]. Documentation/devicetree/bindings/cpu/idle-states.yaml
diff --git a/Documentation/devicetree/bindings/arm/psci.yaml b/Documentation/devicetree/bindings/arm/psci.yaml
index 8b77cf83a095..dd83ef278af0 100644
--- a/Documentation/devicetree/bindings/arm/psci.yaml
+++ b/Documentation/devicetree/bindings/arm/psci.yaml
@@ -101,7 +101,7 @@ properties:
bindings in [1]) must specify this property.
[1] Kernel documentation - ARM idle states bindings
- Documentation/devicetree/bindings/arm/idle-states.yaml
+ Documentation/devicetree/bindings/cpu/idle-states.yaml
patternProperties:
"^power-domain-":
diff --git a/Documentation/devicetree/bindings/arm/idle-states.yaml b/Documentation/devicetree/bindings/cpu/idle-states.yaml
index 4d381fa1ee57..fa4d4142ac93 100644
--- a/Documentation/devicetree/bindings/arm/idle-states.yaml
+++ b/Documentation/devicetree/bindings/cpu/idle-states.yaml
@@ -1,25 +1,30 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
-$id: http://devicetree.org/schemas/arm/idle-states.yaml#
+$id: http://devicetree.org/schemas/cpu/idle-states.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
-title: ARM idle states binding description
+title: Idle states binding description
maintainers:
- Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+ - Anup Patel <anup@brainfault.org>
description: |+
==========================================
1 - Introduction
==========================================
- ARM systems contain HW capable of managing power consumption dynamically,
- where cores can be put in different low-power states (ranging from simple wfi
- to power gating) according to OS PM policies. The CPU states representing the
- range of dynamic idle states that a processor can enter at run-time, can be
- specified through device tree bindings representing the parameters required to
- enter/exit specific idle states on a given processor.
+ ARM and RISC-V systems contain HW capable of managing power consumption
+ dynamically, where cores can be put in different low-power states (ranging
+ from simple wfi to power gating) according to OS PM policies. The CPU states
+ representing the range of dynamic idle states that a processor can enter at
+ run-time, can be specified through device tree bindings representing the
+ parameters required to enter/exit specific idle states on a given processor.
+
+ ==========================================
+ 2 - ARM idle states
+ ==========================================
According to the Server Base System Architecture document (SBSA, [3]), the
power states an ARM CPU can be put into are identified by the following list:
@@ -43,8 +48,23 @@ description: |+
The device tree binding definition for ARM idle states is the subject of this
document.
+ ==========================================
+ 3 - RISC-V idle states
+ ==========================================
+
+ On RISC-V systems, the HARTs (or CPUs) [6] can be put in platform specific
+ suspend (or idle) states (ranging from simple WFI, power gating, etc). The
+ RISC-V SBI v0.3 (or higher) [7] hart state management extension provides a
+ standard mechanism for OS to request HART state transitions.
+
+ The platform specific suspend (or idle) states of a hart can be either
+ retentive or non-rententive in nature. A retentive suspend state will
+ preserve HART registers and CSR values for all privilege modes whereas
+ a non-retentive suspend state will not preserve HART registers and CSR
+ values.
+
===========================================
- 2 - idle-states definitions
+ 4 - idle-states definitions
===========================================
Idle states are characterized for a specific system through a set of
@@ -211,10 +231,10 @@ description: |+
properties specification that is the subject of the following sections.
===========================================
- 3 - idle-states node
+ 5 - idle-states node
===========================================
- ARM processor idle states are defined within the idle-states node, which is
+ The processor idle states are defined within the idle-states node, which is
a direct child of the cpus node [1] and provides a container where the
processor idle states, defined as device tree nodes, are listed.
@@ -223,7 +243,7 @@ description: |+
just supports idle_standby, an idle-states node is not required.
===========================================
- 4 - References
+ 6 - References
===========================================
[1] ARM Linux Kernel documentation - CPUs bindings
@@ -238,9 +258,15 @@ description: |+
[4] ARM Architecture Reference Manuals
http://infocenter.arm.com/help/index.jsp
- [6] ARM Linux Kernel documentation - Booting AArch64 Linux
+ [5] ARM Linux Kernel documentation - Booting AArch64 Linux
Documentation/arm64/booting.rst
+ [6] RISC-V Linux Kernel documentation - CPUs bindings
+ Documentation/devicetree/bindings/riscv/cpus.yaml
+
+ [7] RISC-V Supervisor Binary Interface (SBI)
+ http://github.com/riscv/riscv-sbi-doc/riscv-sbi.adoc
+
properties:
$nodename:
const: idle-states
@@ -253,7 +279,7 @@ properties:
On ARM 32-bit systems this property is optional
This assumes that the "enable-method" property is set to "psci" in the cpu
- node[6] that is responsible for setting up CPU idle management in the OS
+ node[5] that is responsible for setting up CPU idle management in the OS
implementation.
const: psci
@@ -265,8 +291,8 @@ patternProperties:
as follows.
The idle state entered by executing the wfi instruction (idle_standby
- SBSA,[3][4]) is considered standard on all ARM platforms and therefore
- must not be listed.
+ SBSA,[3][4]) is considered standard on all ARM and RISC-V platforms and
+ therefore must not be listed.
In addition to the properties listed above, a state node may require
additional properties specific to the entry-method defined in the
@@ -275,7 +301,27 @@ patternProperties:
properties:
compatible:
- const: arm,idle-state
+ enum:
+ - arm,idle-state
+ - riscv,idle-state
+
+ arm,psci-suspend-param:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: |
+ power_state parameter to pass to the ARM PSCI suspend call.
+
+ Device tree nodes that require usage of PSCI CPU_SUSPEND function
+ (i.e. idle states node with entry-method property is set to "psci")
+ must specify this property.
+
+ riscv,sbi-suspend-param:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: |
+ suspend_type parameter to pass to the RISC-V SBI HSM suspend call.
+
+ This property is required in idle state nodes of device tree meant
+ for RISC-V systems. For more details on the suspend_type parameter
+ refer the SBI specifiation v0.3 (or higher) [7].
local-timer-stop:
description:
@@ -317,6 +363,8 @@ patternProperties:
description:
A string used as a descriptive name for the idle state.
+ additionalProperties: false
+
required:
- compatible
- entry-latency-us
@@ -658,4 +706,150 @@ examples:
};
};
+ - |
+ // Example 3 (RISC-V 64-bit, 4-cpu systems, two clusters):
+
+ cpus {
+ #size-cells = <0>;
+ #address-cells = <1>;
+
+ cpu@0 {
+ device_type = "cpu";
+ compatible = "riscv";
+ reg = <0x0>;
+ riscv,isa = "rv64imafdc";
+ mmu-type = "riscv,sv48";
+ cpu-idle-states = <&CPU_RET_0_0>, <&CPU_NONRET_0_0>,
+ <&CLUSTER_RET_0>, <&CLUSTER_NONRET_0>;
+
+ cpu_intc0: interrupt-controller {
+ #interrupt-cells = <1>;
+ compatible = "riscv,cpu-intc";
+ interrupt-controller;
+ };
+ };
+
+ cpu@1 {
+ device_type = "cpu";
+ compatible = "riscv";
+ reg = <0x1>;
+ riscv,isa = "rv64imafdc";
+ mmu-type = "riscv,sv48";
+ cpu-idle-states = <&CPU_RET_0_0>, <&CPU_NONRET_0_0>,
+ <&CLUSTER_RET_0>, <&CLUSTER_NONRET_0>;
+
+ cpu_intc1: interrupt-controller {
+ #interrupt-cells = <1>;
+ compatible = "riscv,cpu-intc";
+ interrupt-controller;
+ };
+ };
+
+ cpu@10 {
+ device_type = "cpu";
+ compatible = "riscv";
+ reg = <0x10>;
+ riscv,isa = "rv64imafdc";
+ mmu-type = "riscv,sv48";
+ cpu-idle-states = <&CPU_RET_1_0>, <&CPU_NONRET_1_0>,
+ <&CLUSTER_RET_1>, <&CLUSTER_NONRET_1>;
+
+ cpu_intc10: interrupt-controller {
+ #interrupt-cells = <1>;
+ compatible = "riscv,cpu-intc";
+ interrupt-controller;
+ };
+ };
+
+ cpu@11 {
+ device_type = "cpu";
+ compatible = "riscv";
+ reg = <0x11>;
+ riscv,isa = "rv64imafdc";
+ mmu-type = "riscv,sv48";
+ cpu-idle-states = <&CPU_RET_1_0>, <&CPU_NONRET_1_0>,
+ <&CLUSTER_RET_1>, <&CLUSTER_NONRET_1>;
+
+ cpu_intc11: interrupt-controller {
+ #interrupt-cells = <1>;
+ compatible = "riscv,cpu-intc";
+ interrupt-controller;
+ };
+ };
+
+ idle-states {
+ CPU_RET_0_0: cpu-retentive-0-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x10000000>;
+ entry-latency-us = <20>;
+ exit-latency-us = <40>;
+ min-residency-us = <80>;
+ };
+
+ CPU_NONRET_0_0: cpu-nonretentive-0-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x90000000>;
+ entry-latency-us = <250>;
+ exit-latency-us = <500>;
+ min-residency-us = <950>;
+ };
+
+ CLUSTER_RET_0: cluster-retentive-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x11000000>;
+ local-timer-stop;
+ entry-latency-us = <50>;
+ exit-latency-us = <100>;
+ min-residency-us = <250>;
+ wakeup-latency-us = <130>;
+ };
+
+ CLUSTER_NONRET_0: cluster-nonretentive-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x91000000>;
+ local-timer-stop;
+ entry-latency-us = <600>;
+ exit-latency-us = <1100>;
+ min-residency-us = <2700>;
+ wakeup-latency-us = <1500>;
+ };
+
+ CPU_RET_1_0: cpu-retentive-1-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x10000010>;
+ entry-latency-us = <20>;
+ exit-latency-us = <40>;
+ min-residency-us = <80>;
+ };
+
+ CPU_NONRET_1_0: cpu-nonretentive-1-0 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x90000010>;
+ entry-latency-us = <250>;
+ exit-latency-us = <500>;
+ min-residency-us = <950>;
+ };
+
+ CLUSTER_RET_1: cluster-retentive-1 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x11000010>;
+ local-timer-stop;
+ entry-latency-us = <50>;
+ exit-latency-us = <100>;
+ min-residency-us = <250>;
+ wakeup-latency-us = <130>;
+ };
+
+ CLUSTER_NONRET_1: cluster-nonretentive-1 {
+ compatible = "riscv,idle-state";
+ riscv,sbi-suspend-param = <0x91000010>;
+ local-timer-stop;
+ entry-latency-us = <600>;
+ exit-latency-us = <1100>;
+ min-residency-us = <2700>;
+ wakeup-latency-us = <1500>;
+ };
+ };
+ };
+
...
diff --git a/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml b/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml
new file mode 100644
index 000000000000..b1770640f94b
--- /dev/null
+++ b/Documentation/devicetree/bindings/input/mediatek,mt6779-keypad.yaml
@@ -0,0 +1,77 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/input/mediatek,mt6779-keypad.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Mediatek's Keypad Controller device tree bindings
+
+maintainers:
+ - Fengping Yu <fengping.yu@mediatek.com>
+
+allOf:
+ - $ref: "/schemas/input/matrix-keymap.yaml#"
+
+description: |
+ Mediatek's Keypad controller is used to interface a SoC with a matrix-type
+ keypad device. The keypad controller supports multiple row and column lines.
+ A key can be placed at each intersection of a unique row and a unique column.
+ The keypad controller can sense a key-press and key-release and report the
+ event using a interrupt to the cpu.
+
+properties:
+ compatible:
+ oneOf:
+ - const: mediatek,mt6779-keypad
+ - items:
+ - enum:
+ - mediatek,mt6873-keypad
+ - const: mediatek,mt6779-keypad
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ clocks:
+ maxItems: 1
+
+ clock-names:
+ items:
+ - const: kpd
+
+ wakeup-source:
+ description: use any event on keypad as wakeup event
+ type: boolean
+
+ debounce-delay-ms:
+ maximum: 256
+ default: 16
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+ - clock-names
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/input/input.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ soc {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ keyboard@10010000 {
+ compatible = "mediatek,mt6779-keypad";
+ reg = <0 0x10010000 0 0x1000>;
+ interrupts = <GIC_SPI 75 IRQ_TYPE_EDGE_FALLING>;
+ clocks = <&clk26m>;
+ clock-names = "kpd";
+ };
+ };
diff --git a/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
index 535d92885372..9d00f2a8e13a 100644
--- a/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
+++ b/Documentation/devicetree/bindings/input/mtk-pmic-keys.txt
@@ -9,7 +9,10 @@ For MT6397/MT6323 MFD bindings see:
Documentation/devicetree/bindings/mfd/mt6397.txt
Required properties:
-- compatible: "mediatek,mt6397-keys" or "mediatek,mt6323-keys"
+- compatible: Should be one of:
+ - "mediatek,mt6397-keys"
+ - "mediatek,mt6323-keys"
+ - "mediatek,mt6358-keys"
- linux,keycodes: See Documentation/devicetree/bindings/input/input.yaml
Optional Properties:
diff --git a/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
new file mode 100644
index 000000000000..e3a2b871e50c
--- /dev/null
+++ b/Documentation/devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/input/touchscreen/imagis,ist3038c.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Imagis IST30XXC family touchscreen controller bindings
+
+maintainers:
+ - Markuss Broks <markuss.broks@gmail.com>
+
+allOf:
+ - $ref: touchscreen.yaml#
+
+properties:
+ $nodename:
+ pattern: "^touchscreen@[0-9a-f]+$"
+
+ compatible:
+ enum:
+ - imagis,ist3038c
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ vdd-supply:
+ description: Power supply regulator for the chip
+
+ vddio-supply:
+ description: Power supply regulator for the I2C bus
+
+ touchscreen-size-x: true
+ touchscreen-size-y: true
+ touchscreen-fuzz-x: true
+ touchscreen-fuzz-y: true
+ touchscreen-inverted-x: true
+ touchscreen-inverted-y: true
+ touchscreen-swapped-x-y: true
+
+additionalProperties: false
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - touchscreen-size-x
+ - touchscreen-size-y
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+ i2c {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ touchscreen@50 {
+ compatible = "imagis,ist3038c";
+ reg = <0x50>;
+ interrupt-parent = <&gpio>;
+ interrupts = <13 IRQ_TYPE_EDGE_FALLING>;
+ vdd-supply = <&ldo1_reg>;
+ vddio-supply = <&ldo2_reg>;
+ touchscreen-size-x = <720>;
+ touchscreen-size-y = <1280>;
+ touchscreen-fuzz-x = <10>;
+ touchscreen-fuzz-y = <10>;
+ touchscreen-inverted-x;
+ touchscreen-inverted-y;
+ };
+ };
+
+...
diff --git a/Documentation/devicetree/bindings/riscv/cpus.yaml b/Documentation/devicetree/bindings/riscv/cpus.yaml
index aa5fb64d57eb..d632ac76532e 100644
--- a/Documentation/devicetree/bindings/riscv/cpus.yaml
+++ b/Documentation/devicetree/bindings/riscv/cpus.yaml
@@ -99,6 +99,14 @@ properties:
- compatible
- interrupt-controller
+ cpu-idle-states:
+ $ref: '/schemas/types.yaml#/definitions/phandle-array'
+ items:
+ maxItems: 1
+ description: |
+ List of phandles to idle state nodes supported
+ by this hart (see ./idle-states.yaml).
+
required:
- riscv,isa
- interrupt-controller
diff --git a/Documentation/devicetree/bindings/rtc/allwinner,sun6i-a31-rtc.yaml b/Documentation/devicetree/bindings/rtc/allwinner,sun6i-a31-rtc.yaml
index beeb90e55727..0b767fec39d8 100644
--- a/Documentation/devicetree/bindings/rtc/allwinner,sun6i-a31-rtc.yaml
+++ b/Documentation/devicetree/bindings/rtc/allwinner,sun6i-a31-rtc.yaml
@@ -16,16 +16,22 @@ properties:
compatible:
oneOf:
- - const: allwinner,sun6i-a31-rtc
- - const: allwinner,sun8i-a23-rtc
- - const: allwinner,sun8i-h3-rtc
- - const: allwinner,sun8i-r40-rtc
- - const: allwinner,sun8i-v3-rtc
- - const: allwinner,sun50i-h5-rtc
+ - enum:
+ - allwinner,sun6i-a31-rtc
+ - allwinner,sun8i-a23-rtc
+ - allwinner,sun8i-h3-rtc
+ - allwinner,sun8i-r40-rtc
+ - allwinner,sun8i-v3-rtc
+ - allwinner,sun50i-h5-rtc
+ - allwinner,sun50i-h6-rtc
+ - allwinner,sun50i-h616-rtc
+ - allwinner,sun50i-r329-rtc
- items:
- const: allwinner,sun50i-a64-rtc
- const: allwinner,sun8i-h3-rtc
- - const: allwinner,sun50i-h6-rtc
+ - items:
+ - const: allwinner,sun20i-d1-rtc
+ - const: allwinner,sun50i-r329-rtc
reg:
maxItems: 1
@@ -37,7 +43,12 @@ properties:
- description: RTC Alarm 1
clocks:
- maxItems: 1
+ minItems: 1
+ maxItems: 4
+
+ clock-names:
+ minItems: 1
+ maxItems: 4
clock-output-names:
minItems: 1
@@ -85,6 +96,7 @@ allOf:
enum:
- allwinner,sun8i-h3-rtc
- allwinner,sun50i-h5-rtc
+ - allwinner,sun50i-h6-rtc
then:
properties:
@@ -96,19 +108,68 @@ allOf:
properties:
compatible:
contains:
- const: allwinner,sun50i-h6-rtc
+ const: allwinner,sun50i-h616-rtc
then:
properties:
- clock-output-names:
+ clocks:
minItems: 3
maxItems: 3
+ items:
+ - description: Bus clock for register access
+ - description: 24 MHz oscillator
+ - description: 32 kHz clock from the CCU
+
+ clock-names:
+ minItems: 3
+ maxItems: 3
+ items:
+ - const: bus
+ - const: hosc
+ - const: pll-32k
+
+ required:
+ - clocks
+ - clock-names
- if:
properties:
compatible:
contains:
- const: allwinner,sun8i-r40-rtc
+ const: allwinner,sun50i-r329-rtc
+
+ then:
+ properties:
+ clocks:
+ minItems: 3
+ maxItems: 4
+ items:
+ - description: Bus clock for register access
+ - description: 24 MHz oscillator
+ - description: AHB parent for internal SPI clock
+ - description: External 32768 Hz oscillator
+
+ clock-names:
+ minItems: 3
+ maxItems: 4
+ items:
+ - const: bus
+ - const: hosc
+ - const: ahb
+ - const: ext-osc32k
+
+ required:
+ - clocks
+ - clock-names
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - allwinner,sun8i-r40-rtc
+ - allwinner,sun50i-h616-rtc
+ - allwinner,sun50i-r329-rtc
then:
properties:
@@ -127,7 +188,6 @@ required:
- compatible
- reg
- interrupts
- - clock-output-names
additionalProperties: false
diff --git a/Documentation/devicetree/bindings/rtc/atmel,at91sam9-rtc.txt b/Documentation/devicetree/bindings/rtc/atmel,at91sam9-rtc.txt
deleted file mode 100644
index 3f0e2a5950eb..000000000000
--- a/Documentation/devicetree/bindings/rtc/atmel,at91sam9-rtc.txt
+++ /dev/null
@@ -1,25 +0,0 @@
-Atmel AT91SAM9260 Real Time Timer
-
-Required properties:
-- compatible: should be one of the following:
- - "atmel,at91sam9260-rtt"
- - "microchip,sam9x60-rtt", "atmel,at91sam9260-rtt"
-- reg: should encode the memory region of the RTT controller
-- interrupts: rtt alarm/event interrupt
-- clocks: should contain the 32 KHz slow clk that will drive the RTT block.
-- atmel,rtt-rtc-time-reg: should encode the GPBR register used to store
- the time base when the RTT is used as an RTC.
- The first cell should point to the GPBR node and the second one
- encode the offset within the GPBR block (or in other words, the
- GPBR register used to store the time base).
-
-
-Example:
-
-rtt@fffffd20 {
- compatible = "atmel,at91sam9260-rtt";
- reg = <0xfffffd20 0x10>;
- interrupts = <1 4 7>;
- clocks = <&clk32k>;
- atmel,rtt-rtc-time-reg = <&gpbr 0x0>;
-};
diff --git a/Documentation/devicetree/bindings/rtc/atmel,at91sam9260-rtt.yaml b/Documentation/devicetree/bindings/rtc/atmel,at91sam9260-rtt.yaml
new file mode 100644
index 000000000000..0ef1b7ff4a77
--- /dev/null
+++ b/Documentation/devicetree/bindings/rtc/atmel,at91sam9260-rtt.yaml
@@ -0,0 +1,69 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+# Copyright (C) 2022 Microchip Technology, Inc. and its subsidiaries
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/rtc/atmel,at91sam9260-rtt.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Atmel AT91 RTT Device Tree Bindings
+
+allOf:
+ - $ref: "rtc.yaml#"
+
+maintainers:
+ - Alexandre Belloni <alexandre.belloni@bootlin.com>
+
+properties:
+ compatible:
+ oneOf:
+ - items:
+ - const: atmel,at91sam9260-rtt
+ - items:
+ - const: microchip,sam9x60-rtt
+ - const: atmel,at91sam9260-rtt
+ - items:
+ - const: microchip,sama7g5-rtt
+ - const: microchip,sam9x60-rtt
+ - const: atmel,at91sam9260-rtt
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+ clocks:
+ maxItems: 1
+
+ atmel,rtt-rtc-time-reg:
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ items:
+ - items:
+ - description: Phandle to the GPBR node.
+ - description: Offset within the GPBR block.
+ description:
+ Should encode the GPBR register used to store the time base when the
+ RTT is used as an RTC. The first cell should point to the GPBR node
+ and the second one encodes the offset within the GPBR block (or in
+ other words, the GPBR register used to store the time base).
+
+required:
+ - compatible
+ - reg
+ - interrupts
+ - clocks
+ - atmel,rtt-rtc-time-reg
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/irq.h>
+
+ rtc@fffffd20 {
+ compatible = "atmel,at91sam9260-rtt";
+ reg = <0xfffffd20 0x10>;
+ interrupts = <1 IRQ_TYPE_LEVEL_HIGH 7>;
+ clocks = <&clk32k>;
+ atmel,rtt-rtc-time-reg = <&gpbr 0x0>;
+ };
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml b/Documentation/devicetree/bindings/vendor-prefixes.yaml
index 8fe2d934f949..01430973ecec 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.yaml
+++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml
@@ -560,6 +560,8 @@ patternProperties:
description: Ingenieurburo Fur Ic-Technologie (I/F/I)
"^ilitek,.*":
description: ILI Technology Corporation (ILITEK)
+ "^imagis,.*":
+ description: Imagis Technologies Co., Ltd.
"^img,.*":
description: Imagination Technologies Ltd.
"^imi,.*":
diff --git a/Documentation/devicetree/bindings/watchdog/renesas,wdt.yaml b/Documentation/devicetree/bindings/watchdog/renesas,wdt.yaml
index 91a98ccd4226..d060438e1402 100644
--- a/Documentation/devicetree/bindings/watchdog/renesas,wdt.yaml
+++ b/Documentation/devicetree/bindings/watchdog/renesas,wdt.yaml
@@ -55,6 +55,11 @@ properties:
- renesas,r8a779a0-wdt # R-Car V3U
- const: renesas,rcar-gen3-wdt # R-Car Gen3 and RZ/G2
+ - items:
+ - enum:
+ - renesas,r8a779f0-wdt # R-Car S4-8
+ - const: renesas,rcar-gen4-wdt # R-Car Gen4
+
reg:
maxItems: 1
diff --git a/Documentation/filesystems/cifs/ksmbd.rst b/Documentation/filesystems/cifs/ksmbd.rst
index b0d354fd8066..1af600db2e70 100644
--- a/Documentation/filesystems/cifs/ksmbd.rst
+++ b/Documentation/filesystems/cifs/ksmbd.rst
@@ -82,10 +82,10 @@ Signing Update Supported.
Pre-authentication integrity Supported.
SMB3 encryption(CCM, GCM) Supported. (CCM and GCM128 supported, GCM256 in
progress)
-SMB direct(RDMA) Partially Supported. SMB3 Multi-channel is
- required to connect to Windows client.
+SMB direct(RDMA) Supported.
SMB3 Multi-channel Partially Supported. Planned to implement
replay/retry mechanisms for future.
+Receive Side Scaling mode Supported.
SMB3.1.1 POSIX extension Supported.
ACLs Partially Supported. only DACLs available, SACLs
(auditing) is planned for the future. For
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 1d831e3cbcb3..8cc536d08f51 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -549,7 +549,7 @@ Pagecache
~~~~~~~~~
For filesystems using Linux's pagecache, the ``->readpage()`` and
-``->readpages()`` methods must be modified to verify pages before they
+``->readahead()`` methods must be modified to verify pages before they
are marked Uptodate. Merely hooking ``->read_iter()`` would be
insufficient, since ``->read_iter()`` is not used for memory maps.
@@ -611,7 +611,7 @@ workqueue, and then the workqueue work does the decryption or
verification. Finally, pages where no decryption or verity error
occurred are marked Uptodate, and the pages are unlocked.
-Files on ext4 and f2fs may contain holes. Normally, ``->readpages()``
+Files on ext4 and f2fs may contain holes. Normally, ``->readahead()``
simply zeroes holes and sets the corresponding pages Uptodate; no bios
are issued. To prevent this case from bypassing fs-verity, these
filesystems use fsverity_verify_page() to verify hole pages.
@@ -778,7 +778,7 @@ weren't already directly answered in other parts of this document.
- To prevent bypassing verification, pages must not be marked
Uptodate until they've been verified. Currently, each
filesystem is responsible for marking pages Uptodate via
- ``->readpages()``. Therefore, currently it's not possible for
+ ``->readahead()``. Therefore, currently it's not possible for
the VFS to do the verification on its own. Changing this would
require significant changes to the VFS and all filesystems.
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 2998cec9af4b..c26d854275a0 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -241,8 +241,6 @@ prototypes::
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
@@ -274,7 +272,6 @@ readpage: yes, unlocks shared
writepages:
dirty_folio maybe
readahead: yes, unlocks shared
-readpages: no shared
write_begin: locks the page exclusive
write_end: yes, unlocks exclusive
bmap:
@@ -300,9 +297,6 @@ completion.
->readahead() unlocks the pages that I/O is attempted on like ->readpage().
-->readpages() populates the pagecache with the passed pages and starts
-I/O against them. They come unlocked upon I/O completion.
-
->writepage() is used for two purposes: for "memory cleansing" and for
"sync". These are quite different operations and the behaviour may differ
depending upon the mode.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 4f373a8ec47b..69f00179fdfe 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -7,6 +7,8 @@ Network Filesystem Helper Library
.. Contents:
- Overview.
+ - Per-inode context.
+ - Inode context helper functions.
- Buffered read helpers.
- Read helper functions.
- Read helper structures.
@@ -28,6 +30,69 @@ Note that the library module doesn't link against local caching directly, so
access must be provided by the netfs.
+Per-Inode Context
+=================
+
+The network filesystem helper library needs a place to store a bit of state for
+its use on each netfs inode it is helping to manage. To this end, a context
+structure is defined::
+
+ struct netfs_i_context {
+ const struct netfs_request_ops *ops;
+ struct fscache_cookie *cache;
+ };
+
+A network filesystem that wants to use netfs lib must place one of these
+directly after the VFS ``struct inode`` it allocates, usually as part of its
+own struct. This can be done in a way similar to the following::
+
+ struct my_inode {
+ struct {
+ /* These must be contiguous */
+ struct inode vfs_inode;
+ struct netfs_i_context netfs_ctx;
+ };
+ ...
+ };
+
+This allows netfslib to find its state by simple offset from the inode pointer,
+thereby allowing the netfslib helper functions to be pointed to directly by the
+VFS/VM operation tables.
+
+The structure contains the following fields:
+
+ * ``ops``
+
+ The set of operations provided by the network filesystem to netfslib.
+
+ * ``cache``
+
+ Local caching cookie, or NULL if no caching is enabled. This field does not
+ exist if fscache is disabled.
+
+
+Inode Context Helper Functions
+------------------------------
+
+To help deal with the per-inode context, a number helper functions are
+provided. Firstly, a function to perform basic initialisation on a context and
+set the operations table pointer::
+
+ void netfs_i_context_init(struct inode *inode,
+ const struct netfs_request_ops *ops);
+
+then two functions to cast between the VFS inode structure and the netfs
+context::
+
+ struct netfs_i_context *netfs_i_context(struct inode *inode);
+ struct inode *netfs_inode(struct netfs_i_context *ctx);
+
+and finally, a function to get the cache cookie pointer from the context
+attached to an inode (or NULL if fscache is disabled)::
+
+ struct fscache_cookie *netfs_i_cookie(struct inode *inode);
+
+
Buffered Read Helpers
=====================
@@ -70,38 +135,22 @@ Read Helper Functions
Three read helpers are provided::
- void netfs_readahead(struct readahead_control *ractl,
- const struct netfs_read_request_ops *ops,
- void *netfs_priv);
+ void netfs_readahead(struct readahead_control *ractl);
int netfs_readpage(struct file *file,
- struct folio *folio,
- const struct netfs_read_request_ops *ops,
- void *netfs_priv);
+ struct page *page);
int netfs_write_begin(struct file *file,
struct address_space *mapping,
loff_t pos,
unsigned int len,
unsigned int flags,
struct folio **_folio,
- void **_fsdata,
- const struct netfs_read_request_ops *ops,
- void *netfs_priv);
-
-Each corresponds to a VM operation, with the addition of a couple of parameters
-for the use of the read helpers:
+ void **_fsdata);
- * ``ops``
-
- A table of operations through which the helpers can talk to the filesystem.
-
- * ``netfs_priv``
+Each corresponds to a VM address space operation. These operations use the
+state in the per-inode context.
- Filesystem private data (can be NULL).
-
-Both of these values will be stored into the read request structure.
-
-For ->readahead() and ->readpage(), the network filesystem should just jump
-into the corresponding read helper; whereas for ->write_begin(), it may be a
+For ->readahead() and ->readpage(), the network filesystem just point directly
+at the corresponding read helper; whereas for ->write_begin(), it may be a
little more complicated as the network filesystem might want to flush
conflicting writes or track dirty data and needs to put the acquired folio if
an error occurs after calling the helper.
@@ -116,7 +165,7 @@ occurs, the request will get partially completed if sufficient data is read.
Additionally, there is::
- * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
+ * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
ssize_t transferred_or_error,
bool was_async);
@@ -132,7 +181,7 @@ Read Helper Structures
The read helpers make use of a couple of structures to maintain the state of
the read. The first is a structure that manages a read request as a whole::
- struct netfs_read_request {
+ struct netfs_io_request {
struct inode *inode;
struct address_space *mapping;
struct netfs_cache_resources cache_resources;
@@ -140,7 +189,7 @@ the read. The first is a structure that manages a read request as a whole::
loff_t start;
size_t len;
loff_t i_size;
- const struct netfs_read_request_ops *netfs_ops;
+ const struct netfs_request_ops *netfs_ops;
unsigned int debug_id;
...
};
@@ -187,8 +236,8 @@ The above fields are the ones the netfs can use. They are:
The second structure is used to manage individual slices of the overall read
request::
- struct netfs_read_subrequest {
- struct netfs_read_request *rreq;
+ struct netfs_io_subrequest {
+ struct netfs_io_request *rreq;
loff_t start;
size_t len;
size_t transferred;
@@ -244,32 +293,26 @@ Read Helper Operations
The network filesystem must provide the read helpers with a table of operations
through which it can issue requests and negotiate::
- struct netfs_read_request_ops {
- void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
- bool (*is_cache_enabled)(struct inode *inode);
- int (*begin_cache_operation)(struct netfs_read_request *rreq);
- void (*expand_readahead)(struct netfs_read_request *rreq);
- bool (*clamp_length)(struct netfs_read_subrequest *subreq);
- void (*issue_op)(struct netfs_read_subrequest *subreq);
- bool (*is_still_valid)(struct netfs_read_request *rreq);
+ struct netfs_request_ops {
+ void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ int (*begin_cache_operation)(struct netfs_io_request *rreq);
+ void (*expand_readahead)(struct netfs_io_request *rreq);
+ bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+ void (*issue_read)(struct netfs_io_subrequest *subreq);
+ bool (*is_still_valid)(struct netfs_io_request *rreq);
int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
struct folio *folio, void **_fsdata);
- void (*done)(struct netfs_read_request *rreq);
+ void (*done)(struct netfs_io_request *rreq);
void (*cleanup)(struct address_space *mapping, void *netfs_priv);
};
The operations are as follows:
- * ``init_rreq()``
+ * ``init_request()``
[Optional] This is called to initialise the request structure. It is given
the file for reference and can modify the ->netfs_priv value.
- * ``is_cache_enabled()``
-
- [Required] This is called by netfs_write_begin() to ask if the file is being
- cached. It should return true if it is being cached and false otherwise.
-
* ``begin_cache_operation()``
[Optional] This is called to ask the network filesystem to call into the
@@ -305,7 +348,7 @@ The operations are as follows:
This should return 0 on success and an error code on error.
- * ``issue_op()``
+ * ``issue_read()``
[Required] The helpers use this to dispatch a subrequest to the server for
reading. In the subrequest, ->start, ->len and ->transferred indicate what
@@ -420,12 +463,12 @@ The network filesystem's ->begin_cache_operation() method is called to set up a
cache and this must call into the cache to do the work. If using fscache, for
example, the cache would call::
- int fscache_begin_read_operation(struct netfs_read_request *rreq,
+ int fscache_begin_read_operation(struct netfs_io_request *rreq,
struct fscache_cookie *cookie);
passing in the request pointer and the cookie corresponding to the file.
-The netfs_read_request object contains a place for the cache to hang its
+The netfs_io_request object contains a place for the cache to hang its
state::
struct netfs_cache_resources {
@@ -443,7 +486,7 @@ operation table looks like the following::
void (*expand_readahead)(struct netfs_cache_resources *cres,
loff_t *_start, size_t *_len, loff_t i_size);
- enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq,
+ enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
loff_t i_size);
int (*read)(struct netfs_cache_resources *cres,
@@ -562,4 +605,5 @@ API Function Reference
======================
.. kernel-doc:: include/linux/netfs.h
-.. kernel-doc:: fs/netfs/read_helper.c
+.. kernel-doc:: fs/netfs/buffered_read.c
+.. kernel-doc:: fs/netfs/io.c
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f14edf93941..794bd1a66bfb 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -726,8 +726,6 @@ cache in your filesystem. The following members are defined:
int (*writepages)(struct address_space *, struct writeback_control *);
bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
@@ -817,15 +815,6 @@ cache in your filesystem. The following members are defined:
completes successfully. Setting PageError on any page will be
ignored; simply unlock the page if an I/O error occurs.
-``readpages``
- called by the VM to read pages associated with the address_space
- object. This is essentially just a vector version of readpage.
- Instead of just one page, several pages are requested.
- readpages is only used for read-ahead, so read errors are
- ignored. If anything goes wrong, feel free to give up.
- This interface is deprecated and will be removed by the end of
- 2020; implement readahead instead.
-
``write_begin``
Called by the generic buffered write code to ask the filesystem
to prepare to write len bytes at the given offset in the file.
diff --git a/Documentation/kbuild/kbuild.rst b/Documentation/kbuild/kbuild.rst
index 2d1fc03d346e..ef19b9c13523 100644
--- a/Documentation/kbuild/kbuild.rst
+++ b/Documentation/kbuild/kbuild.rst
@@ -77,6 +77,17 @@ HOSTLDLIBS
----------
Additional libraries to link against when building host programs.
+.. _userkbuildflags:
+
+USERCFLAGS
+----------
+Additional options used for $(CC) when compiling userprogs.
+
+USERLDFLAGS
+-----------
+Additional options used for $(LD) when linking userprogs. userprogs are linked
+with CC, so $(USERLDFLAGS) should include "-Wl," prefix as applicable.
+
KBUILD_KCONFIG
--------------
Set the top-level Kconfig file to the value of this environment
diff --git a/Documentation/kbuild/llvm.rst b/Documentation/kbuild/llvm.rst
index d32616891dcf..b854bb413164 100644
--- a/Documentation/kbuild/llvm.rst
+++ b/Documentation/kbuild/llvm.rst
@@ -49,17 +49,36 @@ example: ::
LLVM Utilities
--------------
-LLVM has substitutes for GNU binutils utilities. Kbuild supports ``LLVM=1``
-to enable them. ::
-
- make LLVM=1
-
-They can be enabled individually. The full list of the parameters: ::
+LLVM has substitutes for GNU binutils utilities. They can be enabled individually.
+The full list of supported make variables::
make CC=clang LD=ld.lld AR=llvm-ar NM=llvm-nm STRIP=llvm-strip \
OBJCOPY=llvm-objcopy OBJDUMP=llvm-objdump READELF=llvm-readelf \
HOSTCC=clang HOSTCXX=clang++ HOSTAR=llvm-ar HOSTLD=ld.lld
+To simplify the above command, Kbuild supports the ``LLVM`` variable::
+
+ make LLVM=1
+
+If your LLVM tools are not available in your PATH, you can supply their
+location using the LLVM variable with a trailing slash::
+
+ make LLVM=/path/to/llvm/
+
+which will use ``/path/to/llvm/clang``, ``/path/to/llvm/ld.lld``, etc.
+
+If your LLVM tools have a version suffix and you want to test with that
+explicit version rather than the unsuffixed executables like ``LLVM=1``, you
+can pass the suffix using the ``LLVM`` variable::
+
+ make LLVM=-14
+
+which will use ``clang-14``, ``ld.lld-14``, etc.
+
+``LLVM=0`` is not the same as omitting ``LLVM`` altogether, it will behave like
+``LLVM=1``. If you only wish to use certain LLVM utilities, use their respective
+make variables.
+
The integrated assembler is enabled by default. You can pass ``LLVM_IAS=0`` to
disable it.
diff --git a/Documentation/kbuild/makefiles.rst b/Documentation/kbuild/makefiles.rst
index b008b90b92c9..11a296e52d68 100644
--- a/Documentation/kbuild/makefiles.rst
+++ b/Documentation/kbuild/makefiles.rst
@@ -982,6 +982,8 @@ The syntax is quite similar. The difference is to use "userprogs" instead of
When linking bpfilter_umh, it will be passed the extra option -static.
+ From command line, :ref:`USERCFLAGS and USERLDFLAGS <userkbuildflags>` will also be used.
+
5.4 When userspace programs are actually built
----------------------------------------------
diff --git a/Documentation/locking/locktypes.rst b/Documentation/locking/locktypes.rst
index bfa75ea1b66a..9933faad4771 100644
--- a/Documentation/locking/locktypes.rst
+++ b/Documentation/locking/locktypes.rst
@@ -211,9 +211,6 @@ raw_spinlock_t and spinlock_t
raw_spinlock_t
--------------
-raw_spinlock_t is a strict spinning lock implementation regardless of the
-kernel configuration including PREEMPT_RT enabled kernels.
-
raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
core code, low-level interrupt handling and places where disabling
diff --git a/Documentation/maintainer/index.rst b/Documentation/maintainer/index.rst
index f0a60435b124..3e03283c144e 100644
--- a/Documentation/maintainer/index.rst
+++ b/Documentation/maintainer/index.rst
@@ -12,6 +12,7 @@ additions to this manual.
configure-git
rebasing-and-merging
pull-requests
+ messy-diffstat
maintainer-entry-profile
modifying-patches
diff --git a/Documentation/maintainer/messy-diffstat.rst b/Documentation/maintainer/messy-diffstat.rst
new file mode 100644
index 000000000000..c015f66d7621
--- /dev/null
+++ b/Documentation/maintainer/messy-diffstat.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Handling messy pull-request diffstats
+=====================================
+
+Subsystem maintainers routinely use ``git request-pull`` as part of the
+process of sending work upstream. Normally, the result includes a nice
+diffstat that shows which files will be touched and how much of each will
+be changed. Occasionally, though, a repository with a relatively
+complicated development history will yield a massive diffstat containing a
+great deal of unrelated work. The result looks ugly and obscures what the
+pull request is actually doing. This document describes what is happening
+and how to fix things up; it is derived from The Wisdom of Linus Torvalds,
+found in Linus1_ and Linus2_.
+
+.. _Linus1: https://lore.kernel.org/lkml/CAHk-=wg3wXH2JNxkQi+eLZkpuxqV+wPiHhw_Jf7ViH33Sw7PHA@mail.gmail.com/
+.. _Linus2: https://lore.kernel.org/lkml/CAHk-=wgXbSa8yq8Dht8at+gxb_idnJ7X5qWZQWRBN4_CUPr=eQ@mail.gmail.com/
+
+A Git development history proceeds as a series of commits. In a simplified
+manner, mainline kernel development looks like this::
+
+ ... vM --- vN-rc1 --- vN-rc2 --- vN-rc3 --- ... --- vN-rc7 --- vN
+
+If one wants to see what has changed between two points, a command like
+this will do the job::
+
+ $ git diff --stat --summary vN-rc2..vN-rc3
+
+Here, there are two clear points in the history; Git will essentially
+"subtract" the beginning point from the end point and display the resulting
+differences. The requested operation is unambiguous and easy enough to
+understand.
+
+When a subsystem maintainer creates a branch and commits changes to it, the
+result in the simplest case is a history that looks like::
+
+ ... vM --- vN-rc1 --- vN-rc2 --- vN-rc3 --- ... --- vN-rc7 --- vN
+ |
+ +-- c1 --- c2 --- ... --- cN
+
+If that maintainer now uses ``git diff`` to see what has changed between
+the mainline branch (let's call it "linus") and cN, there are still two
+clear endpoints, and the result is as expected. So a pull request
+generated with ``git request-pull`` will also be as expected. But now
+consider a slightly more complex development history::
+
+ ... vM --- vN-rc1 --- vN-rc2 --- vN-rc3 --- ... --- vN-rc7 --- vN
+ | |
+ | +-- c1 --- c2 --- ... --- cN
+ | /
+ +-- x1 --- x2 --- x3
+
+Our maintainer has created one branch at vN-rc1 and another at vN-rc2; the
+two were then subsequently merged into c2. Now a pull request generated
+for cN may end up being messy indeed, and developers often end up wondering
+why.
+
+What is happening here is that there are no longer two clear end points for
+the ``git diff`` operation to use. The development culminating in cN
+started in two different places; to generate the diffstat, ``git diff``
+ends up having pick one of them and hoping for the best. If the diffstat
+starts at vN-rc1, it may end up including all of the changes between there
+and the second origin end point (vN-rc2), which is certainly not what our
+maintainer had in mind. With all of that extra junk in the diffstat, it
+may be impossible to tell what actually happened in the changes leading up
+to cN.
+
+Maintainers often try to resolve this problem by, for example, rebasing the
+branch or performing another merge with the linus branch, then recreating
+the pull request. This approach tends not to lead to joy at the receiving
+end of that pull request; rebasing and/or merging just before pushing
+upstream is a well-known way to get a grumpy response.
+
+So what is to be done? The best response when confronted with this
+situation is to indeed to do a merge with the branch you intend your work
+to be pulled into, but to do it privately, as if it were the source of
+shame. Create a new, throwaway branch and do the merge there::
+
+ ... vM --- vN-rc1 --- vN-rc2 --- vN-rc3 --- ... --- vN-rc7 --- vN
+ | | |
+ | +-- c1 --- c2 --- ... --- cN |
+ | / | |
+ +-- x1 --- x2 --- x3 +------------+-- TEMP
+
+The merge operation resolves all of the complications resulting from the
+multiple beginning points, yielding a coherent result that contains only
+the differences from the mainline branch. Now it will be possible to
+generate a diffstat with the desired information::
+
+ $ git diff -C --stat --summary linus..TEMP
+
+Save the output from this command, then simply delete the TEMP branch;
+definitely do not expose it to the outside world. Take the saved diffstat
+output and edit it into the messy pull request, yielding a result that
+shows what is really going on. That request can then be sent upstream.
diff --git a/Documentation/riscv/index.rst b/Documentation/riscv/index.rst
index ea915c196048..e23b876ad6eb 100644
--- a/Documentation/riscv/index.rst
+++ b/Documentation/riscv/index.rst
@@ -7,7 +7,6 @@ RISC-V architecture
boot-image-header
vm-layout
- pmu
patch-acceptance
features
diff --git a/Documentation/sphinx/kernel_abi.py b/Documentation/sphinx/kernel_abi.py
index 4392b3cb4020..b5feb5b1d905 100644
--- a/Documentation/sphinx/kernel_abi.py
+++ b/Documentation/sphinx/kernel_abi.py
@@ -128,6 +128,7 @@ class KernelCmd(Directive):
return out
def nestedParse(self, lines, fname):
+ env = self.state.document.settings.env
content = ViewList()
node = nodes.section()
@@ -137,7 +138,7 @@ class KernelCmd(Directive):
code_block += "\n " + l
lines = code_block + "\n\n"
- line_regex = re.compile("^#define LINENO (\S+)\#([0-9]+)$")
+ line_regex = re.compile("^\.\. LINENO (\S+)\#([0-9]+)$")
ln = 0
n = 0
f = fname
@@ -154,6 +155,9 @@ class KernelCmd(Directive):
self.do_parse(content, node)
content = ViewList()
+ # Add the file to Sphinx build dependencies
+ env.note_dependency(os.path.abspath(f))
+
f = new_f
# sphinx counts lines from 0
diff --git a/Documentation/sphinx/kernel_feat.py b/Documentation/sphinx/kernel_feat.py
index 8138d69a6987..27b701ed3681 100644
--- a/Documentation/sphinx/kernel_feat.py
+++ b/Documentation/sphinx/kernel_feat.py
@@ -33,6 +33,7 @@ u"""
import codecs
import os
+import re
import subprocess
import sys
@@ -82,7 +83,7 @@ class KernelFeat(Directive):
env = doc.settings.env
cwd = path.dirname(doc.current_source)
- cmd = "get_feat.pl rest --dir "
+ cmd = "get_feat.pl rest --enable-fname --dir "
cmd += self.arguments[0]
if len(self.arguments) > 1:
@@ -102,7 +103,22 @@ class KernelFeat(Directive):
shell_env["srctree"] = srctree
lines = self.runCmd(cmd, shell=True, cwd=cwd, env=shell_env)
- nodeList = self.nestedParse(lines, fname)
+
+ line_regex = re.compile("^\.\. FILE (\S+)$")
+
+ out_lines = ""
+
+ for line in lines.split("\n"):
+ match = line_regex.search(line)
+ if match:
+ fname = match.group(1)
+
+ # Add the file to Sphinx build dependencies
+ env.note_dependency(os.path.abspath(fname))
+ else:
+ out_lines += line + "\n"
+
+ nodeList = self.nestedParse(out_lines, fname)
return nodeList
def runCmd(self, cmd, **kwargs):
diff --git a/Documentation/sphinx/kernel_include.py b/Documentation/sphinx/kernel_include.py
index f523aa68a36b..abe768088377 100755
--- a/Documentation/sphinx/kernel_include.py
+++ b/Documentation/sphinx/kernel_include.py
@@ -59,6 +59,7 @@ class KernelInclude(Include):
u"""KernelInclude (``kernel-include``) directive"""
def run(self):
+ env = self.state.document.settings.env
path = os.path.realpath(
os.path.expandvars(self.arguments[0]))
@@ -70,6 +71,8 @@ class KernelInclude(Include):
self.arguments[0] = path
+ env.note_dependency(os.path.abspath(path))
+
#return super(KernelInclude, self).run() # won't work, see HINTs in _run()
return self._run()
diff --git a/Documentation/sphinx/kerneldoc.py b/Documentation/sphinx/kerneldoc.py
index 8189c33b9dda..9395892c7ba3 100644
--- a/Documentation/sphinx/kerneldoc.py
+++ b/Documentation/sphinx/kerneldoc.py
@@ -130,7 +130,7 @@ class KernelDocDirective(Directive):
result = ViewList()
lineoffset = 0;
- line_regex = re.compile("^#define LINENO ([0-9]+)$")
+ line_regex = re.compile("^\.\. LINENO ([0-9]+)$")
for line in lines:
match = line_regex.search(line)
if match:
diff --git a/Documentation/sphinx/kfigure.py b/Documentation/sphinx/kfigure.py
index 24d2b2addcce..cefdbb7e7523 100644
--- a/Documentation/sphinx/kfigure.py
+++ b/Documentation/sphinx/kfigure.py
@@ -212,7 +212,7 @@ def setupTools(app):
if convert_cmd:
kernellog.verbose(app, "use convert(1) from: " + convert_cmd)
else:
- kernellog.warn(app,
+ kernellog.verbose(app,
"Neither inkscape(1) nor convert(1) found.\n"
"For SVG to PDF conversion, "
"install either Inkscape (https://inkscape.org/) (preferred) or\n"
@@ -296,8 +296,10 @@ def convert_image(img_node, translator, src_fname=None):
if translator.builder.format == 'latex':
if not inkscape_cmd and convert_cmd is None:
- kernellog.verbose(app,
- "no SVG to PDF conversion available / include SVG raw.")
+ kernellog.warn(app,
+ "no SVG to PDF conversion available / include SVG raw."
+ "\nIncluding large raw SVGs can cause xelatex error."
+ "\nInstall Inkscape (preferred) or ImageMagick.")
img_node.replace_self(file2literal(src_fname))
else:
dst_fname = path.join(translator.builder.outdir, fname + '.pdf')
diff --git a/Documentation/sphinx/requirements.txt b/Documentation/sphinx/requirements.txt
index 9a35f50798a6..2c573541ab71 100644
--- a/Documentation/sphinx/requirements.txt
+++ b/Documentation/sphinx/requirements.txt
@@ -1,2 +1,4 @@
+# jinja2>=3.1 is not compatible with Sphinx<4.0
+jinja2<3.1
sphinx_rtd_theme
Sphinx==2.4.4
diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
index bddedabaca80..c180936f49fc 100644
--- a/Documentation/trace/user_events.rst
+++ b/Documentation/trace/user_events.rst
@@ -7,7 +7,7 @@ user_events: User-based Event Tracing
Overview
--------
User based trace events allow user processes to create events and trace data
-that can be viewed via existing tools, such as ftrace, perf and eBPF.
+that can be viewed via existing tools, such as ftrace and perf.
To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
Programs can view status of the events via
@@ -67,8 +67,7 @@ The command string format is as follows::
Supported Flags
^^^^^^^^^^^^^^^
-**BPF_ITER** - EBPF programs attached to this event will get the raw iovec
-struct instead of any data copies for max performance.
+None yet
Field Format
^^^^^^^^^^^^
@@ -160,7 +159,7 @@ The following values are defined to aid in checking what has been attached:
**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0).
-**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1).
+**EVENT_STATUS_PERF** - Bit set if perf has been attached (Bit 1).
Writing Data
------------
@@ -204,13 +203,6 @@ It's advised for user programs to do the following::
**NOTE:** *The write_index is not emitted out into the trace being recorded.*
-EBPF
-----
-EBPF programs that attach to a user-based event tracepoint are given a pointer
-to a struct user_bpf_context. The bpf context contains the data type (which can
-be a user or kernel buffer, or can be a pointer to the iovec) and the data
-length that was emitted (minus the write_index).
-
Example Code
------------
See sample code in samples/user_events.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 07a45474abe9..d13fa6600467 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -151,12 +151,6 @@ In order to create user controlled virtual machines on S390, check
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
privileged user (CAP_SYS_ADMIN).
-To use hardware assisted virtualization on MIPS (VZ ASE) rather than
-the default trap & emulate implementation (which changes the virtual
-memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
-flag KVM_VM_MIPS_VZ.
-
-
On arm64, the physical address size for a VM (IPA Size limit) is limited
to 40bits by default. The limit can be configured if the host supports the
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -4081,6 +4075,11 @@ x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
register.
+.. warning::
+ MSR accesses coming from nested vmentry/vmexit are not filtered.
+ This includes both writes to individual VMCS fields and reads/writes
+ through the MSR lists pointed to by the VMCS.
+
If a bit is within one of the defined ranges, read and write accesses are
guarded by the bitmap's value for the MSR index if the kind of access
is included in the ``struct kvm_msr_filter_range`` flags. If no range
@@ -5293,6 +5292,10 @@ type values:
KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO
Sets the guest physical address of the vcpu_info for a given vCPU.
+ As with the shared_info page for the VM, the corresponding page may be
+ dirtied at any time if event channel interrupt delivery is enabled, so
+ userspace should always assume that the page is dirty without relying
+ on dirty logging.
KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO
Sets the guest physical address of an additional pvclock structure
@@ -7719,3 +7722,49 @@ only be invoked on a VM prior to the creation of VCPUs.
At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting
this capability will disable PMU virtualization for that VM. Usermode
should adjust CPUID leaf 0xA to reflect that the PMU is disabled.
+
+9. Known KVM API problems
+=========================
+
+In some cases, KVM's API has some inconsistencies or common pitfalls
+that userspace need to be aware of. This section details some of
+these issues.
+
+Most of them are architecture specific, so the section is split by
+architecture.
+
+9.1. x86
+--------
+
+``KVM_GET_SUPPORTED_CPUID`` issues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In general, ``KVM_GET_SUPPORTED_CPUID`` is designed so that it is possible
+to take its result and pass it directly to ``KVM_SET_CPUID2``. This section
+documents some cases in which that requires some care.
+
+Local APIC features
+~~~~~~~~~~~~~~~~~~~
+
+CPU[EAX=1]:ECX[21] (X2APIC) is reported by ``KVM_GET_SUPPORTED_CPUID``,
+but it can only be enabled if ``KVM_CREATE_IRQCHIP`` or
+``KVM_ENABLE_CAP(KVM_CAP_IRQCHIP_SPLIT)`` are used to enable in-kernel emulation of
+the local APIC.
+
+The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature.
+
+CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``.
+It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
+has enabled in-kernel emulation of the local APIC.
+
+Obsolete ioctls and capabilities
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+KVM_CAP_DISABLE_QUIRKS does not let userspace know which quirks are actually
+available. Use ``KVM_CHECK_EXTENSION(KVM_CAP_DISABLE_QUIRKS2)`` instead if
+available.
+
+Ordering of KVM_GET_*/KVM_SET_* ioctls
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+TBD
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index b6833c7bb474..e0a2c74e1043 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -8,25 +8,13 @@ KVM
:maxdepth: 2
api
- amd-memory-encryption
- cpuid
- halt-polling
- hypercalls
- locking
- mmu
- msr
- nested-vmx
- ppc-pv
- s390-diag
- s390-pv
- s390-pv-boot
- timekeeping
- vcpu-requests
-
- review-checklist
+ devices/index
arm/index
+ s390/index
+ ppc-pv
+ x86/index
- devices/index
-
- running-nested-guests
+ locking
+ vcpu-requests
+ review-checklist
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 5d27da356836..845a561629f1 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -210,32 +210,47 @@ time it will be set using the Dirty tracking mechanism described above.
3. Reference
------------
-:Name: kvm_lock
+``kvm_lock``
+^^^^^^^^^^^^
+
:Type: mutex
:Arch: any
:Protects: - vm_list
-:Name: kvm_count_lock
+``kvm_count_lock``
+^^^^^^^^^^^^^^^^^^
+
:Type: raw_spinlock_t
:Arch: any
:Protects: - hardware virtualization enable/disable
:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
migration.
-:Name: kvm_arch::tsc_write_lock
-:Type: raw_spinlock
+``kvm->mn_invalidate_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: spinlock_t
+:Arch: any
+:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait
+
+``kvm_arch::tsc_write_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:Type: raw_spinlock_t
:Arch: x86
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
- tsc offset in vmcb
:Comment: 'raw' because updating the tsc offsets must not be preempted.
-:Name: kvm->mmu_lock
-:Type: spinlock_t
+``kvm->mmu_lock``
+^^^^^^^^^^^^^^^^^
+:Type: spinlock_t or rwlock_t
:Arch: any
:Protects: -shadow page/shadow tlb entry
:Comment: it is a spinlock since it is used in mmu notifier.
-:Name: kvm->srcu
+``kvm->srcu``
+^^^^^^^^^^^^^
:Type: srcu lock
:Arch: any
:Protects: - kvm->memslots
@@ -246,10 +261,20 @@ time it will be set using the Dirty tracking mechanism described above.
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
if it is needed by multiple functions.
-:Name: blocked_vcpu_on_cpu_lock
+``kvm->slots_arch_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^
+:Type: mutex
+:Arch: any (only needed on x86 though)
+:Protects: any arch-specific fields of memslots that have to be modified
+ in a ``kvm->srcu`` read-side critical section.
+:Comment: must be held before reading the pointer to the current memslots,
+ until after all changes to the memslots are complete
+
+``wakeup_vcpus_on_cpu_lock``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:Type: spinlock_t
:Arch: x86
-:Protects: blocked_vcpu_on_cpu
+:Protects: wakeup_vcpus_on_cpu
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
When VT-d posted-interrupts is supported and the VM has assigned
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
diff --git a/Documentation/virt/kvm/s390/index.rst b/Documentation/virt/kvm/s390/index.rst
new file mode 100644
index 000000000000..605f488f0cc5
--- /dev/null
+++ b/Documentation/virt/kvm/s390/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+KVM for s390 systems
+====================
+
+.. toctree::
+ :maxdepth: 2
+
+ s390-diag
+ s390-pv
+ s390-pv-boot
diff --git a/Documentation/virt/kvm/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst
index ca85f030eb0b..ca85f030eb0b 100644
--- a/Documentation/virt/kvm/s390-diag.rst
+++ b/Documentation/virt/kvm/s390/s390-diag.rst
diff --git a/Documentation/virt/kvm/s390-pv-boot.rst b/Documentation/virt/kvm/s390/s390-pv-boot.rst
index 73a6083cb5e7..73a6083cb5e7 100644
--- a/Documentation/virt/kvm/s390-pv-boot.rst
+++ b/Documentation/virt/kvm/s390/s390-pv-boot.rst
diff --git a/Documentation/virt/kvm/s390-pv.rst b/Documentation/virt/kvm/s390/s390-pv.rst
index 8e41a3b63fa5..8e41a3b63fa5 100644
--- a/Documentation/virt/kvm/s390-pv.rst
+++ b/Documentation/virt/kvm/s390/s390-pv.rst
diff --git a/Documentation/virt/kvm/vcpu-requests.rst b/Documentation/virt/kvm/vcpu-requests.rst
index b61d48aec36c..db43ee571f5a 100644
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@@ -135,6 +135,16 @@ KVM_REQ_UNHALT
such as a pending signal, which does not indicate the VCPU's halt
emulation should stop, and therefore does not make the request.
+KVM_REQ_OUTSIDE_GUEST_MODE
+
+ This "request" ensures the target vCPU has exited guest mode prior to the
+ sender of the request continuing on. No action needs be taken by the target,
+ and so no request is actually logged for the target. This request is similar
+ to a "kick", but unlike a kick it guarantees the vCPU has actually exited
+ guest mode. A kick only guarantees the vCPU will exit at some point in the
+ future, e.g. a previous kick may have started the process, but there's no
+ guarantee the to-be-kicked vCPU has fully exited guest mode.
+
KVM_REQUEST_MASK
----------------
diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
index 1c6847fff304..1c6847fff304 100644
--- a/Documentation/virt/kvm/amd-memory-encryption.rst
+++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..bda3e3e737d7 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst
new file mode 100644
index 000000000000..806f049b6975
--- /dev/null
+++ b/Documentation/virt/kvm/x86/errata.rst
@@ -0,0 +1,39 @@
+
+=======================================
+Known limitations of CPU virtualization
+=======================================
+
+Whenever perfect emulation of a CPU feature is impossible or too hard, KVM
+has to choose between not implementing the feature at all or introducing
+behavioral differences between virtual machines and bare metal systems.
+
+This file documents some of the known limitations that KVM has in
+virtualizing CPU features.
+
+x86
+===
+
+``KVM_GET_SUPPORTED_CPUID`` issues
+----------------------------------
+
+x87 features
+~~~~~~~~~~~~
+
+Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6]
+(FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are
+clear if the features are present and set if the features are not present.
+
+Clearing these bits in CPUID has no effect on the operation of the guest;
+if these bits are set on hardware, the features will not be present on
+any virtual machine that runs on that hardware.
+
+**Workaround:** It is recommended to always set these bits in guest CPUID.
+Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
+to be present likely predates these CPUID feature bits, and therefore
+doesn't know to check for them anyway.
+
+Nested virtualization features
+------------------------------
+
+TBD
+
diff --git a/Documentation/virt/kvm/halt-polling.rst b/Documentation/virt/kvm/x86/halt-polling.rst
index 4922e4a15f18..4922e4a15f18 100644
--- a/Documentation/virt/kvm/halt-polling.rst
+++ b/Documentation/virt/kvm/x86/halt-polling.rst
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst
index e56fa8b9cfca..e56fa8b9cfca 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/x86/hypercalls.rst
diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
new file mode 100644
index 000000000000..7ff588826b9f
--- /dev/null
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+KVM for x86 systems
+===================
+
+.. toctree::
+ :maxdepth: 2
+
+ amd-memory-encryption
+ cpuid
+ errata
+ halt-polling
+ hypercalls
+ mmu
+ msr
+ nested-vmx
+ running-nested-guests
+ timekeeping
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/x86/mmu.rst
index 5b1ebad24c77..5b1ebad24c77 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/x86/mmu.rst
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 9315fc385fb0..9315fc385fb0 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
diff --git a/Documentation/virt/kvm/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index ac2095d41f02..ac2095d41f02 100644
--- a/Documentation/virt/kvm/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
diff --git a/Documentation/virt/kvm/running-nested-guests.rst b/Documentation/virt/kvm/x86/running-nested-guests.rst
index bd70c69468ae..bd70c69468ae 100644
--- a/Documentation/virt/kvm/running-nested-guests.rst
+++ b/Documentation/virt/kvm/x86/running-nested-guests.rst
diff --git a/Documentation/virt/kvm/timekeeping.rst b/Documentation/virt/kvm/x86/timekeeping.rst
index 21ae7efa29ba..21ae7efa29ba 100644
--- a/Documentation/virt/kvm/timekeeping.rst
+++ b/Documentation/virt/kvm/x86/timekeeping.rst
diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
index d5ad96c795f6..863f67b72c05 100644
--- a/Documentation/virt/uml/user_mode_linux_howto_v2.rst
+++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst
@@ -1193,6 +1193,26 @@ E.g. ``os_close_file()`` is just a wrapper around ``close()``
which ensures that the userspace function close does not clash
with similarly named function(s) in the kernel part.
+Using UML as a Test Platform
+============================
+
+UML is an excellent test platform for device driver development. As
+with most things UML, "some user assembly may be required". It is
+up to the user to build their emulation environment. UML at present
+provides only the kernel infrastructure.
+
+Part of this infrastructure is the ability to load and parse fdt
+device tree blobs as used in Arm or Open Firmware platforms. These
+are supplied as an optional extra argument to the kernel command
+line::
+
+ dtb=filename
+
+The device tree is loaded and parsed at boottime and is accessible by
+drivers which query it. At this moment in time this facility is
+intended solely for development purposes. UML's own devices do not
+query the device tree.
+
Security Considerations
-----------------------
diff --git a/Documentation/vm/page_owner.rst b/Documentation/vm/page_owner.rst
index c4de6f8dabe9..65204d7f004f 100644
--- a/Documentation/vm/page_owner.rst
+++ b/Documentation/vm/page_owner.rst
@@ -125,7 +125,6 @@ Usage
additional function:
Cull:
- -c Cull by comparing stacktrace instead of total block.
--cull <rules>
Specify culling rules.Culling syntax is key[,key[,...]].Choose a
multi-letter key from the **STANDARD FORMAT SPECIFIERS** section.
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index eae3af17f2d9..b280367d6a44 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -52,8 +52,13 @@ The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future.
-The Unevictable Page List
--------------------------
+The Unevictable LRU Page List
+-----------------------------
+
+The Unevictable LRU page list is a lie. It was never an LRU-ordered list, but a
+companion to the LRU-ordered anonymous and file, active and inactive page lists;
+and now it is not even a page list. But following familiar convention, here in
+this document and in the source, we often imagine it as a fifth LRU page list.
The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
@@ -63,8 +68,8 @@ The PG_unevictable flag is analogous to, and mutually exclusive with, the
PG_active flag in that it indicates on which LRU list a page resides when
PG_lru is set.
-The Unevictable LRU infrastructure maintains unevictable pages on an additional
-LRU list for a few reasons:
+The Unevictable LRU infrastructure maintains unevictable pages as if they were
+on an additional LRU list for a few reasons:
(1) We get to "treat unevictable pages just like we treat other pages in the
system - which means we get to use the same code to manipulate them, the
@@ -72,13 +77,11 @@ LRU list for a few reasons:
of the statistics, etc..." [Rik van Riel]
(2) We want to be able to migrate unevictable pages between nodes for memory
- defragmentation, workload management and memory hotplug. The linux kernel
+ defragmentation, workload management and memory hotplug. The Linux kernel
can only migrate pages that it can successfully isolate from the LRU
- lists. If we were to maintain pages elsewhere than on an LRU-like list,
- where they can be found by isolate_lru_page(), we would prevent their
- migration, unless we reworked migration code to find the unevictable pages
- itself.
-
+ lists (or "Movable" pages: outside of consideration here). If we were to
+ maintain pages elsewhere than on an LRU-like list, where they can be
+ detected by isolate_lru_page(), we would prevent their migration.
The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
@@ -92,8 +95,8 @@ Memory Control Group Interaction
--------------------------------
The unevictable LRU facility interacts with the memory control group [aka
-memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
-lru_list enum.
+memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by
+extending the lru_list enum.
The memory controller data structure automatically gets a per-node unevictable
list as a result of the "arrayification" of the per-node LRU lists (one per
@@ -143,7 +146,6 @@ These are currently used in three places in the kernel:
and this mark remains for the life of the inode.
(2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
-
Note that SHM_LOCK is not required to page in the locked pages if they're
swapped out; the application must touch the pages manually if it wants to
ensure they're in memory.
@@ -156,19 +158,19 @@ These are currently used in three places in the kernel:
Detecting Unevictable Pages
---------------------------
-The function page_evictable() in vmscan.c determines whether a page is
+The function page_evictable() in mm/internal.h determines whether a page is
evictable or not using the query function outlined above [see section
:ref:`Marking address spaces unevictable <mark_addr_space_unevict>`]
to check the AS_UNEVICTABLE flag.
For address spaces that are so marked after being populated (as SHM regions
-might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
+might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate
the page tables for the region as does, for example, mlock(), nor need it make
any special effort to push any pages in the SHM_LOCK'd area to the unevictable
list. Instead, vmscan will do this if and when it encounters the pages during
a reclamation scan.
-On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
+On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan
the pages in the region and "rescue" them from the unevictable list if no other
condition is keeping them unevictable. If an unevictable region is destroyed,
the pages are also "rescued" from the unevictable list in the process of
@@ -176,7 +178,7 @@ freeing them.
page_evictable() also checks for mlocked pages by testing an additional page
flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
-faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.
+faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
Vmscan's Handling of Unevictable Pages
@@ -186,28 +188,23 @@ If unevictable pages are culled in the fault path, or moved to the unevictable
list at mlock() or mmap() time, vmscan will not encounter the pages until they
have become evictable again (via munlock() for example) and have been "rescued"
from the unevictable list. However, there may be situations where we decide,
-for the sake of expediency, to leave a unevictable page on one of the regular
+for the sake of expediency, to leave an unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the node being scanned.
+unevictable list for the memory cgroup and node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
-shrink_page_list() where they will be detected when vmscan walks the reverse
-map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
-shrink_page_list() will cull the page at that point.
+shrink_active_list() or shrink_page_list() where they will be detected when
+vmscan walks the reverse map in page_referenced() or try_to_unmap(). The page
+is culled to the unevictable list when it is released by the shrinker.
To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
using putback_lru_page() - the inverse operation to isolate_lru_page() - after
dropping the page lock. Because the condition which makes the page unevictable
-may change once the page is unlocked, putback_lru_page() will recheck the
-unevictable state of a page that it places on the unevictable list. If the
-page has become unevictable, putback_lru_page() removes it from the list and
-retries, including the page_unevictable() test. Because such a race is a rare
-event and movement of pages onto the unevictable list should be rare, these
-extra evictabilty checks should not occur in the majority of calls to
-putback_lru_page().
+may change once the page is unlocked, __pagevec_lru_add_fn() will recheck the
+unevictable state of a page before placing it on the unevictable list.
MLOCKED Pages
@@ -227,16 +224,25 @@ Nick posted his patch as an alternative to a patch posted by Christoph Lameter
to achieve the same objective: hiding mlocked pages from vmscan.
In Nick's patch, he used one of the struct page LRU list link fields as a count
-of VM_LOCKED VMAs that map the page. This use of the link field for a count
-prevented the management of the pages on an LRU list, and thus mlocked pages
-were not migratable as isolate_lru_page() could not find them, and the LRU list
-link field was not available to the migration subsystem.
+of VM_LOCKED VMAs that map the page (Rik van Riel had the same idea three years
+earlier). But this use of the link field for a count prevented the management
+of the pages on an LRU list, and thus mlocked pages were not migratable as
+isolate_lru_page() could not detect them, and the LRU list link field was not
+available to the migration subsystem.
-Nick resolved this by putting mlocked pages back on the lru list before
+Nick resolved this by putting mlocked pages back on the LRU list before
attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
Nick's patch was integrated with the Unevictable LRU work, the count was
-replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
-mapped the page. More on this below.
+replaced by walking the reverse map when munlocking, to determine whether any
+other VM_LOCKED VMAs still mapped the page.
+
+However, walking the reverse map for each page when munlocking was ugly and
+inefficient, and could lead to catastrophic contention on a file's rmap lock,
+when many processes which had it mlocked were trying to exit. In 5.18, the
+idea of keeping mlock_count in Unevictable LRU list link field was revived and
+put to work, without preventing the migration of mlocked pages. This is why
+the "Unevictable LRU list" cannot be a linked list of pages now; but there was
+no use for that linked list anyway - though its size is maintained for meminfo.
Basic Management
@@ -250,22 +256,18 @@ PageMlocked() functions.
A PG_mlocked page will be placed on the unevictable list when it is added to
the LRU. Such pages can be "noticed" by memory management in several places:
- (1) in the mlock()/mlockall() system call handlers;
+ (1) in the mlock()/mlock2()/mlockall() system call handlers;
(2) in the mmap() system call handler when mmapping a region with the
MAP_LOCKED flag;
(3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
- flag
+ flag;
- (4) in the fault path, if mlocked pages are "culled" in the fault path,
- and when a VM_LOCKED stack segment is expanded; or
+ (4) in the fault path and when a VM_LOCKED stack segment is expanded; or
(5) as mentioned above, in vmscan:shrink_page_list() when attempting to
- reclaim a page in a VM_LOCKED VMA via try_to_unmap()
-
-all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
-already have it set.
+ reclaim a page in a VM_LOCKED VMA by page_referenced() or try_to_unmap().
mlocked pages become unlocked and rescued from the unevictable list when:
@@ -280,51 +282,53 @@ mlocked pages become unlocked and rescued from the unevictable list when:
(4) before a page is COW'd in a VM_LOCKED VMA.
-mlock()/mlockall() System Call Handling
----------------------------------------
+mlock()/mlock2()/mlockall() System Call Handling
+------------------------------------------------
-Both [do\_]mlock() and [do\_]mlockall() system call handlers call mlock_fixup()
+mlock(), mlock2() and mlockall() system call handlers proceed to mlock_fixup()
for each VMA in the range specified by the call. In the case of mlockall(),
this is the entire active address space of the task. Note that mlock_fixup()
is used for both mlocking and munlocking a range of memory. A call to mlock()
-an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
-treated as a no-op, and mlock_fixup() simply returns.
+an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED, is
+treated as a no-op and mlock_fixup() simply returns.
-If the VMA passes some filtering as described in "Filtering Special Vmas"
+If the VMA passes some filtering as described in "Filtering Special VMAs"
below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
-off a subset of the VMA if the range does not cover the entire VMA. Once the
-VMA has been merged or split or neither, mlock_fixup() will call
-populate_vma_page_range() to fault in the pages via get_user_pages() and to
-mark the pages as mlocked via mlock_vma_page().
+off a subset of the VMA if the range does not cover the entire VMA. Any pages
+already present in the VMA are then marked as mlocked by mlock_page() via
+mlock_pte_range() via walk_page_range() via mlock_vma_pages_range().
+
+Before returning from the system call, do_mlock() or mlockall() will call
+__mm_populate() to fault in the remaining pages via get_user_pages() and to
+mark those pages as mlocked as they are faulted.
Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
get_user_pages() will be unable to fault in the pages. That's okay. If pages
-do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
-fault path or in vmscan.
-
-Also note that a page returned by get_user_pages() could be truncated or
-migrated out from under us, while we're trying to mlock it. To detect this,
-populate_vma_page_range() checks page_mapping() after acquiring the page lock.
-If the page is still associated with its mapping, we'll go ahead and call
-mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
-In the worst case, this will result in a page mapped in a VM_LOCKED VMA
-remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
-detect and cull such pages.
-
-mlock_vma_page() will call TestSetPageMlocked() for each page returned by
-get_user_pages(). We use TestSetPageMlocked() because the page might already
-be mlocked by another task/VMA and we don't want to do extra work. We
-especially do not want to count an mlocked page more than once in the
-statistics. If the page was already mlocked, mlock_vma_page() need do nothing
-more.
-
-If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
-page from the LRU, as it is likely on the appropriate active or inactive list
-at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
-back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the node's unevictable list. If
-mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
-it later if and when it attempts to reclaim the page.
+do end up getting faulted into this VM_LOCKED VMA, they will be handled in the
+fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled.
+
+For each PTE (or PMD) being faulted into a VMA, the page add rmap function
+calls mlock_vma_page(), which calls mlock_page() when the VMA is VM_LOCKED
+(unless it is a PTE mapping of a part of a transparent huge page). Or when
+it is a newly allocated anonymous page, lru_cache_add_inactive_or_unevictable()
+calls mlock_new_page() instead: similar to mlock_page(), but can make better
+judgments, since this page is held exclusively and known not to be on LRU yet.
+
+mlock_page() sets PageMlocked immediately, then places the page on the CPU's
+mlock pagevec, to batch up the rest of the work to be done under lru_lock by
+__mlock_page(). __mlock_page() sets PageUnevictable, initializes mlock_count
+and moves the page to unevictable state ("the unevictable LRU", but with
+mlock_count in place of LRU threading). Or if the page was already PageLRU
+and PageUnevictable and PageMlocked, it simply increments the mlock_count.
+
+But in practice that may not work ideally: the page may not yet be on an LRU, or
+it may have been temporarily isolated from LRU. In such cases the mlock_count
+field cannot be touched, but will be set to 0 later when __pagevec_lru_add_fn()
+returns the page to "LRU". Races prohibit mlock_count from being set to 1 then:
+rather than risk stranding a page indefinitely as unevictable, always err with
+mlock_count on the low side, so that when munlocked the page will be rescued to
+an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a
+VM_LOCKED VMA.
Filtering Special VMAs
@@ -339,68 +343,48 @@ mlock_fixup() filters several classes of "special" VMAs:
so there is no sense in attempting to visit them.
2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
- neither need nor want to mlock() these pages. However, to preserve the
- prior behavior of mlock() - before the unevictable/mlock changes -
- mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
- allocate the huge pages and populate the ptes.
+ neither need nor want to mlock() these pages. But __mm_populate() includes
+ hugetlbfs ranges, allocating the huge pages and populating the PTEs.
3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages,
- such as the VDSO page, relay channel pages, etc. These pages
- are inherently unevictable and are not managed on the LRU lists.
- mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
- make_pages_present() to populate the ptes.
+ such as the VDSO page, relay channel pages, etc. These pages are inherently
+ unevictable and are not managed on the LRU lists. __mm_populate() includes
+ these ranges, populating the PTEs if not already populated.
+
+4) VMAs with VM_MIXEDMAP set are not marked VM_LOCKED, but __mm_populate()
+ includes these ranges, populating the PTEs if not already populated.
Note that for all of these special VMAs, mlock_fixup() does not set the
VM_LOCKED flag. Therefore, we won't have to deal with them later during
munlock(), munmap() or task exit. Neither does mlock_fixup() account these
VMAs against the task's "locked_vm".
-.. _munlock_munlockall_handling:
munlock()/munlockall() System Call Handling
-------------------------------------------
-The munlock() and munlockall() system calls are handled by the same functions -
-do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
-lock operation indicated by an argument. So, these system calls are also
-handled by mlock_fixup(). Again, if called for an already munlocked VMA,
-mlock_fixup() simply returns. Because of the VMA filtering discussed above,
-VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
-ignored for munlock.
+The munlock() and munlockall() system calls are handled by the same
+mlock_fixup() function as mlock(), mlock2() and mlockall() system calls are.
+If called to munlock an already munlocked VMA, mlock_fixup() simply returns.
+Because of the VMA filtering discussed above, VM_LOCKED will not be set in
+any "special" VMAs. So, those VMAs will be ignored for munlock.
If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
-specified range. The range is then munlocked via the function
-populate_vma_page_range() - the same function used to mlock a VMA range -
-passing a flag to indicate that munlock() is being performed.
-
-Because the VMA access protections could have been changed to PROT_NONE after
-faulting in and mlocking pages, get_user_pages() was unreliable for visiting
-these pages for munlocking. Because we don't want to leave pages mlocked,
-get_user_pages() was enhanced to accept a flag to ignore the permissions when
-fetching the pages - all of which should be resident as a result of previous
-mlocking.
-
-For munlock(), populate_vma_page_range() unlocks individual pages by calling
-munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
-flag using TestClearPageMlocked(). As with mlock_vma_page(),
-munlock_vma_page() use the Test*PageMlocked() function to handle the case where
-the page might have already been unlocked by another task. If the page was
-mlocked, munlock_vma_page() updates that zone statistics for the number of
-mlocked pages. Note, however, that at this point we haven't checked whether
-the page is mapped by other VM_LOCKED VMAs.
-
-We can't call page_mlock(), the function that walks the reverse map to
-check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-page_mlock() is a variant of try_to_unmap() and thus requires that the page
-not be on an LRU list [more on these below]. However, the call to
-isolate_lru_page() could fail, in which case we can't call page_mlock(). So,
-we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have. If we can successfully isolate the page, we go ahead and call
-page_mlock(), which will restore the PG_mlocked flag and update the zone
-page statistics if it finds another VMA holding the page mlocked. If we fail
-to isolate the page, we'll have left a potentially mlocked page on the LRU.
-This is fine, because we'll catch it later if and if vmscan tries to reclaim
-the page. This should be relatively rare.
+specified range. All pages in the VMA are then munlocked by munlock_page() via
+mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same
+function used when mlocking a VMA range, with new flags for the VMA indicating
+that it is munlock() being performed.
+
+munlock_page() uses the mlock pagevec to batch up work to be done under
+lru_lock by __munlock_page(). __munlock_page() decrements the page's
+mlock_count, and when that reaches 0 it clears PageMlocked and clears
+PageUnevictable, moving the page from unevictable state to inactive LRU.
+
+But in practice that may not work ideally: the page may not yet have reached
+"the unevictable LRU", or it may have been temporarily isolated from it. In
+those cases its mlock_count field is unusable and must be assumed to be 0: so
+that the page will be rescued to an evictable LRU, then perhaps be mlocked
+again later if vmscan finds it in a VM_LOCKED VMA.
Migrating MLOCKED Pages
@@ -410,33 +394,38 @@ A page that is being migrated has been isolated from the LRU lists and is held
locked across unmapping of the page, updating the page's address space entry
and copying the contents and state, until the page table entry has been
replaced with an entry that refers to the new page. Linux supports migration
-of mlocked pages and other unevictable pages. This involves simply moving the
-PG_mlocked and PG_unevictable states from the old page to the new page.
+of mlocked pages and other unevictable pages. PG_mlocked is cleared from the
+the old page when it is unmapped from the last VM_LOCKED VMA, and set when the
+new page is mapped in place of migration entry in a VM_LOCKED VMA. If the page
+was unevictable because mlocked, PG_unevictable follows PG_mlocked; but if the
+page was unevictable for other reasons, PG_unevictable is copied explicitly.
Note that page migration can race with mlocking or munlocking of the same page.
-This has been discussed from the mlock/munlock perspective in the respective
-sections above. Both processes (migration and m[un]locking) hold the page
-locked. This provides the first level of synchronization. Page migration
-zeros out the page_mapping of the old page before unlocking it, so m[un]lock
-can skip these pages by testing the page mapping under page lock.
+There is mostly no problem since page migration requires unmapping all PTEs of
+the old page (including munlock where VM_LOCKED), then mapping in the new page
+(including mlock where VM_LOCKED). The page table locks provide sufficient
+synchronization.
-To complete page migration, we place the new and old pages back onto the LRU
-after dropping the page lock. The "unneeded" page - old page on success, new
-page on failure - will be freed when the reference count held by the migration
-process is released. To ensure that we don't strand pages on the unevictable
-list because of a race between munlock and migration, page migration uses the
-putback_lru_page() function to add migrated pages back to the LRU.
+However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA,
+before mlocking any pages already present, if one of those pages were migrated
+before mlock_pte_range() reached it, it would get counted twice in mlock_count.
+To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO,
+so that mlock_vma_page() will skip it.
+
+To complete page migration, we place the old and new pages back onto the LRU
+afterwards. The "unneeded" page - old page on success, new page on failure -
+is freed when the reference count held by the migration process is released.
Compacting MLOCKED Pages
------------------------
-The unevictable LRU can be scanned for compactable regions and the default
-behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls
-this behavior (see Documentation/admin-guide/sysctl/vm.rst). Once scanning of the
-unevictable LRU is enabled, the work of compaction is mostly handled by
-the page migration code and the same work flow as described in MIGRATING
-MLOCKED PAGES will apply.
+The memory map can be scanned for compactable regions and the default behavior
+is to let unevictable pages be moved. /proc/sys/vm/compact_unevictable_allowed
+controls this behavior (see Documentation/admin-guide/sysctl/vm.rst). The work
+of compaction is mostly handled by the page migration code and the same work
+flow as described in Migrating MLOCKED Pages will apply.
+
MLOCKING Transparent Huge Pages
-------------------------------
@@ -445,51 +434,44 @@ A transparent huge page is represented by a single entry on an LRU list.
Therefore, we can only make unevictable an entire compound page, not
individual subpages.
-If a user tries to mlock() part of a huge page, we want the rest of the
-page to be reclaimable.
+If a user tries to mlock() part of a huge page, and no user mlock()s the
+whole of the huge page, we want the rest of the page to be reclaimable.
We cannot just split the page on partial mlock() as split_huge_page() can
-fail and new intermittent failure mode for the syscall is undesirable.
+fail and a new intermittent failure mode for the syscall is undesirable.
-We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
-PMD on border of VM_LOCKED VMA will be split into PTE table.
+We handle this by keeping PTE-mlocked huge pages on evictable LRU lists:
+the PMD on the border of a VM_LOCKED VMA will be split into a PTE table.
-This way the huge page is accessible for vmscan. Under memory pressure the
+This way the huge page is accessible for vmscan. Under memory pressure the
page will be split, subpages which belong to VM_LOCKED VMAs will be moved
-to unevictable LRU and the rest can be reclaimed.
+to the unevictable LRU and the rest can be reclaimed.
+
+/proc/meminfo's Unevictable and Mlocked amounts do not include those parts
+of a transparent huge page which are mapped only by PTEs in VM_LOCKED VMAs.
-See also comment in follow_trans_huge_pmd().
mmap(MAP_LOCKED) System Call Handling
-------------------------------------
-In addition the mlock()/mlockall() system calls, an application can request
-that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
-call. There is one important and subtle difference here, though. mmap() + mlock()
-will fail if the range cannot be faulted in (e.g. because mm_populate fails)
-and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmaped
-area will still have properties of the locked area - aka. pages will not get
-swapped out - but major page faults to fault memory in might still happen.
+In addition to the mlock(), mlock2() and mlockall() system calls, an application
+can request that a region of memory be mlocked by supplying the MAP_LOCKED flag
+to the mmap() call. There is one important and subtle difference here, though.
+mmap() + mlock() will fail if the range cannot be faulted in (e.g. because
+mm_populate fails) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail.
+The mmaped area will still have properties of the locked area - pages will not
+get swapped out - but major page faults to fault memory in might still happen.
-Furthermore, any mmap() call or brk() call that expands the heap by a
-task that has previously called mlockall() with the MCL_FUTURE flag will result
+Furthermore, any mmap() call or brk() call that expands the heap by a task
+that has previously called mlockall() with the MCL_FUTURE flag will result
in the newly mapped memory being mlocked. Before the unevictable/mlock
-changes, the kernel simply called make_pages_present() to allocate pages and
-populate the page table.
+changes, the kernel simply called make_pages_present() to allocate pages
+and populate the page table.
-To mlock a range of memory under the unevictable/mlock infrastructure, the
-mmap() handler and task address space expansion functions call
+To mlock a range of memory under the unevictable/mlock infrastructure,
+the mmap() handler and task address space expansion functions call
populate_vma_page_range() specifying the vma and the address range to mlock.
-The callers of populate_vma_page_range() will have already added the memory range
-to be mlocked to the task's "locked_vm". To account for filtered VMAs,
-populate_vma_page_range() returns the number of pages NOT mlocked. All of the
-callers then subtract a non-negative return value from the task's locked_vm. A
-negative return value represent an error - for example, from get_user_pages()
-attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
-memory range accounted as locked_vm, as the protections could be changed later
-and pages allocated into that region.
-
munmap()/exit()/exec() System Call Handling
-------------------------------------------
@@ -500,81 +482,53 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
Before the unevictable/mlock changes, mlocking did not mark the pages in any
way, so unmapping them required no processing.
-To munlock a range of memory under the unevictable/mlock infrastructure, the
-munmap() handler and task address space call tear down function
-munlock_vma_pages_all(). The name reflects the observation that one always
-specifies the entire VMA range when munlock()ing during unmap of a region.
-Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
-actually contain mlocked pages will be passed to munlock_vma_pages_all().
-
-munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
-for the munlock case, calls __munlock_vma_pages_range() to walk the page table
-for the VMA's memory range and munlock_vma_page() each resident page mapped by
-the VMA. This effectively munlocks the page, only if this is the last
-VM_LOCKED VMA that maps the page.
-
-
-try_to_unmap()
---------------
-
-Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
-have VM_LOCKED flag set. It is possible for a page mapped into one or more
-VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
-of the active or inactive LRU lists. This could happen if, for example, a task
-in the process of munlocking the page could not isolate the page from the LRU.
-As a result, vmscan/shrink_page_list() might encounter such a page as described
-in section "vmscan's handling of unevictable pages". To handle this situation,
-try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
-map.
-
-try_to_unmap() is always called, by either vmscan for reclaim or for page
-migration, with the argument page locked and isolated from the LRU. Separate
-functions handle anonymous and mapped file and KSM pages, as these types of
-pages have different reverse map lookup mechanisms, with different locking.
-In each case, whether rmap_walk_anon() or rmap_walk_file() or rmap_walk_ksm(),
-it will call try_to_unmap_one() for every VMA which might contain the page.
-
-When trying to reclaim, if try_to_unmap_one() finds the page in a VM_LOCKED
-VMA, it will then mlock the page via mlock_vma_page() instead of unmapping it,
-and return SWAP_MLOCK to indicate that the page is unevictable: and the scan
-stops there.
-
-mlock_vma_page() is called while holding the page table's lock (in addition
-to the page lock, and the rmap lock): to serialize against concurrent mlock or
-munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim,
-holepunching, and truncation of file pages and their anonymous COWed pages.
-
-
-page_mlock() Reverse Map Scan
----------------------------------
-
-When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
-Handling <munlock_munlockall_handling>` above] tries to munlock a
-page, it needs to determine whether or not the page is mapped by any
-VM_LOCKED VMA without actually attempting to unmap all PTEs from the
-page. For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called page_mlock().
-
-page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
-such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
-pre-clearing of the page's PG_mlocked done by munlock_vma_page.
-
-Note that page_mlock()'s reverse map walk must visit every VMA in a page's
-reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
-However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although page_mlock() might be called a great many times when munlocking a
-large region or tearing down a large address space that has been mlocked via
-mlockall(), overall this is a fairly rare event.
+For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
+munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
+(unless it was a PTE mapping of a part of a transparent huge page).
+
+munlock_page() uses the mlock pagevec to batch up work to be done under
+lru_lock by __munlock_page(). __munlock_page() decrements the page's
+mlock_count, and when that reaches 0 it clears PageMlocked and clears
+PageUnevictable, moving the page from unevictable state to inactive LRU.
+
+But in practice that may not work ideally: the page may not yet have reached
+"the unevictable LRU", or it may have been temporarily isolated from it. In
+those cases its mlock_count field is unusable and must be assumed to be 0: so
+that the page will be rescued to an evictable LRU, then perhaps be mlocked
+again later if vmscan finds it in a VM_LOCKED VMA.
+
+
+Truncating MLOCKED Pages
+------------------------
+
+File truncation or hole punching forcibly unmaps the deleted pages from
+userspace; truncation even unmaps and deletes any private anonymous pages
+which had been Copied-On-Write from the file pages now being truncated.
+
+Mlocked pages can be munlocked and deleted in this way: like with munmap(),
+for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls
+munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED
+(unless it was a PTE mapping of a part of a transparent huge page).
+
+However, if there is a racing munlock(), since mlock_vma_pages_range() starts
+munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages
+present, if one of those pages were unmapped by truncation or hole punch before
+mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA,
+and would not be counted out of mlock_count. In this rare case, a page may
+still appear as PageMlocked after it has been fully unmapped: and it is left to
+release_pages() (or __page_cache_release()) to clear it and update statistics
+before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared,
+which is usually 0).
Page Reclaim in shrink_*_list()
-------------------------------
-shrink_active_list() culls any obviously unevictable pages - i.e.
-!page_evictable(page) - diverting these to the unevictable list.
+vmscan's shrink_active_list() culls any obviously unevictable pages -
+i.e. !page_evictable(page) pages - diverting those to the unevictable list.
However, shrink_active_list() only sees unevictable pages that made it onto the
-active/inactive lru lists. Note that these pages do not have PageUnevictable
-set - otherwise they would be on the unevictable list and shrink_active_list
+active/inactive LRU lists. Note that these pages do not have PageUnevictable
+set - otherwise they would be on the unevictable list and shrink_active_list()
would never see them.
Some examples of these unevictable pages on the LRU lists are:
@@ -586,20 +540,15 @@ Some examples of these unevictable pages on the LRU lists are:
when an application accesses the page the first time after SHM_LOCK'ing
the segment.
- (3) mlocked pages that could not be isolated from the LRU and moved to the
- unevictable list in mlock_vma_page().
-
-shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate node's unevictable list.
+ (3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked,
+ but events left mlock_count too low, so they were munlocked too early.
-shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
-after shrink_active_list() had moved them to the inactive list, or pages mapped
-into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
-recheck via page_mlock(). shrink_inactive_list() won't notice the latter,
-but will pass on to shrink_page_list().
+vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously
+unevictable pages found on the inactive lists to the appropriate memory cgroup
+and node unevictable list.
-shrink_page_list() again culls obviously unevictable pages that it could
-encounter for similar reason to shrink_inactive_list(). Pages mapped into
-VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
-try_to_unmap(). shrink_page_list() will divert them to the unevictable list
-when try_to_unmap() returns SWAP_MLOCK, as discussed above.
+rmap's page_referenced_one(), called via vmscan's shrink_active_list() or
+shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(),
+check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page()
+to correct them. Such pages are culled to the unevictable list when released
+by the shrinker.