diff options
Diffstat (limited to 'Documentation')
113 files changed, 3082 insertions, 558 deletions
diff --git a/Documentation/ABI/testing/debugfs-driver-qat_telemetry b/Documentation/ABI/testing/debugfs-driver-qat_telemetry new file mode 100644 index 000000000000..eacee2072088 --- /dev/null +++ b/Documentation/ABI/testing/debugfs-driver-qat_telemetry @@ -0,0 +1,228 @@ +What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/control +Date: March 2024 +KernelVersion: 6.8 +Contact: qat-linux@intel.com +Description: (RW) Enables/disables the reporting of telemetry metrics. + + Allowed values to write: + ======================== + * 0: disable telemetry + * 1: enable telemetry + * 2, 3, 4: enable telemetry and calculate minimum, maximum + and average for each counter over 2, 3 or 4 samples + + Returned values: + ================ + * 1-4: telemetry is enabled and running + * 0: telemetry is disabled + + Example. + + Writing '3' to this file starts the collection of + telemetry metrics. Samples are collected every second and + stored in a circular buffer of size 3. These values are then + used to calculate the minimum, maximum and average for each + counter. After enabling, counters can be retrieved through + the ``device_data`` file:: + + echo 3 > /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/control + + Writing '0' to this file stops the collection of telemetry + metrics:: + + echo 0 > /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/control + + This attribute is only available for qat_4xxx devices. + +What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/device_data +Date: March 2024 +KernelVersion: 6.8 +Contact: qat-linux@intel.com +Description: (RO) Reports device telemetry counters. + Reads report metrics about performance and utilization of + a QAT device: + + ======================= ======================================== + Field Description + ======================= ======================================== + sample_cnt number of acquisitions of telemetry data + from the device. Reads are performed + every 1000 ms. + pci_trans_cnt number of PCIe partial transactions + max_rd_lat maximum logged read latency [ns] (could + be any read operation) + rd_lat_acc_avg average read latency [ns] + max_gp_lat max get to put latency [ns] (only takes + samples for AE0) + gp_lat_acc_avg average get to put latency [ns] + bw_in PCIe, write bandwidth [Mbps] + bw_out PCIe, read bandwidth [Mbps] + at_page_req_lat_avg Address Translator(AT), average page + request latency [ns] + at_trans_lat_avg AT, average page translation latency [ns] + at_max_tlb_used AT, maximum uTLB used + util_cpr<N> utilization of Compression slice N [%] + exec_cpr<N> execution count of Compression slice N + util_xlt<N> utilization of Translator slice N [%] + exec_xlt<N> execution count of Translator slice N + util_dcpr<N> utilization of Decompression slice N [%] + exec_dcpr<N> execution count of Decompression slice N + util_pke<N> utilization of PKE N [%] + exec_pke<N> execution count of PKE N + util_ucs<N> utilization of UCS slice N [%] + exec_ucs<N> execution count of UCS slice N + util_wat<N> utilization of Wireless Authentication + slice N [%] + exec_wat<N> execution count of Wireless Authentication + slice N + util_wcp<N> utilization of Wireless Cipher slice N [%] + exec_wcp<N> execution count of Wireless Cipher slice N + util_cph<N> utilization of Cipher slice N [%] + exec_cph<N> execution count of Cipher slice N + util_ath<N> utilization of Authentication slice N [%] + exec_ath<N> execution count of Authentication slice N + ======================= ======================================== + + The telemetry report file can be read with the following command:: + + cat /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/device_data + + If ``control`` is set to 1, only the current values of the + counters are displayed:: + + <counter_name> <current> + + If ``control`` is 2, 3 or 4, counters are displayed in the + following format:: + + <counter_name> <current> <min> <max> <avg> + + If a device lacks of a specific accelerator, the corresponding + attribute is not reported. + + This attribute is only available for qat_4xxx devices. + +What: /sys/kernel/debug/qat_<device>_<BDF>/telemetry/rp_<A/B/C/D>_data +Date: March 2024 +KernelVersion: 6.8 +Contact: qat-linux@intel.com +Description: (RW) Selects up to 4 Ring Pairs (RP) to monitor, one per file, + and report telemetry counters related to each. + + Allowed values to write: + ======================== + * 0 to ``<num_rps - 1>``: + Ring pair to be monitored. The value of ``num_rps`` can be + retrieved through ``/sys/bus/pci/devices/<BDF>/qat/num_rps``. + See Documentation/ABI/testing/sysfs-driver-qat. + + Reads report metrics about performance and utilization of + the selected RP: + + ======================= ======================================== + Field Description + ======================= ======================================== + sample_cnt number of acquisitions of telemetry data + from the device. Reads are performed + every 1000 ms + rp_num RP number associated with slot <A/B/C/D> + service_type service associated to the RP + pci_trans_cnt number of PCIe partial transactions + gp_lat_acc_avg average get to put latency [ns] + bw_in PCIe, write bandwidth [Mbps] + bw_out PCIe, read bandwidth [Mbps] + at_glob_devtlb_hit Message descriptor DevTLB hit rate + at_glob_devtlb_miss Message descriptor DevTLB miss rate + tl_at_payld_devtlb_hit Payload DevTLB hit rate + tl_at_payld_devtlb_miss Payload DevTLB miss rate + ======================= ======================================== + + Example. + + Writing the value '32' to the file ``rp_C_data`` starts the + collection of telemetry metrics for ring pair 32:: + + echo 32 > /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/rp_C_data + + Once a ring pair is selected, statistics can be read accessing + the file:: + + cat /sys/kernel/debug/qat_4xxx_0000:6b:00.0/telemetry/rp_C_data + + If ``control`` is set to 1, only the current values of the + counters are displayed:: + + <counter_name> <current> + + If ``control`` is 2, 3 or 4, counters are displayed in the + following format:: + + <counter_name> <current> <min> <max> <avg> + + + On QAT GEN4 devices there are 64 RPs on a PF, so the allowed + values are 0..63. This number is absolute to the device. + If Virtual Functions (VF) are used, the ring pair number can + be derived from the Bus, Device, Function of the VF: + + ============ ====== ====== ====== ====== + PCI BDF/VF RP0 RP1 RP2 RP3 + ============ ====== ====== ====== ====== + 0000:6b:0.1 RP 0 RP 1 RP 2 RP 3 + 0000:6b:0.2 RP 4 RP 5 RP 6 RP 7 + 0000:6b:0.3 RP 8 RP 9 RP 10 RP 11 + 0000:6b:0.4 RP 12 RP 13 RP 14 RP 15 + 0000:6b:0.5 RP 16 RP 17 RP 18 RP 19 + 0000:6b:0.6 RP 20 RP 21 RP 22 RP 23 + 0000:6b:0.7 RP 24 RP 25 RP 26 RP 27 + 0000:6b:1.0 RP 28 RP 29 RP 30 RP 31 + 0000:6b:1.1 RP 32 RP 33 RP 34 RP 35 + 0000:6b:1.2 RP 36 RP 37 RP 38 RP 39 + 0000:6b:1.3 RP 40 RP 41 RP 42 RP 43 + 0000:6b:1.4 RP 44 RP 45 RP 46 RP 47 + 0000:6b:1.5 RP 48 RP 49 RP 50 RP 51 + 0000:6b:1.6 RP 52 RP 53 RP 54 RP 55 + 0000:6b:1.7 RP 56 RP 57 RP 58 RP 59 + 0000:6b:2.0 RP 60 RP 61 RP 62 RP 63 + ============ ====== ====== ====== ====== + + The mapping is only valid for the BDFs of VFs on the host. + + + The service provided on a ring-pair varies depending on the + configuration. The configuration for a given device can be + queried and set using ``cfg_services``. + See Documentation/ABI/testing/sysfs-driver-qat for details. + + The following table reports how ring pairs are mapped to VFs + on the PF 0000:6b:0.0 configured for `sym;asym` or `asym;sym`: + + =========== ============ =========== ============ =========== + PCI BDF/VF RP0/service RP1/service RP2/service RP3/service + =========== ============ =========== ============ =========== + 0000:6b:0.1 RP 0 asym RP 1 sym RP 2 asym RP 3 sym + 0000:6b:0.2 RP 4 asym RP 5 sym RP 6 asym RP 7 sym + 0000:6b:0.3 RP 8 asym RP 9 sym RP10 asym RP11 sym + ... ... ... ... ... + =========== ============ =========== ============ =========== + + All VFs follow the same pattern. + + + The following table reports how ring pairs are mapped to VFs on + the PF 0000:6b:0.0 configured for `dc`: + + =========== ============ =========== ============ =========== + PCI BDF/VF RP0/service RP1/service RP2/service RP3/service + =========== ============ =========== ============ =========== + 0000:6b:0.1 RP 0 dc RP 1 dc RP 2 dc RP 3 dc + 0000:6b:0.2 RP 4 dc RP 5 dc RP 6 dc RP 7 dc + 0000:6b:0.3 RP 8 dc RP 9 dc RP10 dc RP11 dc + ... ... ... ... ... + =========== ============ =========== ============ =========== + + The mapping of a RP to a service can be retrieved using + ``rp2srv`` from sysfs. + See Documentation/ABI/testing/sysfs-driver-qat for details. + + This attribute is only available for qat_4xxx devices. diff --git a/Documentation/ABI/testing/debugfs-hisi-hpre b/Documentation/ABI/testing/debugfs-hisi-hpre index 82abf92df429..8e8de49c5cc6 100644 --- a/Documentation/ABI/testing/debugfs-hisi-hpre +++ b/Documentation/ABI/testing/debugfs-hisi-hpre @@ -101,7 +101,7 @@ What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/status Date: Apr 2020 Contact: linux-crypto@vger.kernel.org Description: Dump the status of the QM. - Four states: initiated, started, stopped and closed. + Two states: work, stop. Available for both PF and VF, and take no other effect on HPRE. What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/diff_regs diff --git a/Documentation/ABI/testing/debugfs-hisi-sec b/Documentation/ABI/testing/debugfs-hisi-sec index 93c530d1bf0f..deeefe2c735e 100644 --- a/Documentation/ABI/testing/debugfs-hisi-sec +++ b/Documentation/ABI/testing/debugfs-hisi-sec @@ -81,7 +81,7 @@ What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/status Date: Apr 2020 Contact: linux-crypto@vger.kernel.org Description: Dump the status of the QM. - Four states: initiated, started, stopped and closed. + Two states: work, stop. Available for both PF and VF, and take no other effect on SEC. What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/diff_regs diff --git a/Documentation/ABI/testing/debugfs-hisi-zip b/Documentation/ABI/testing/debugfs-hisi-zip index fd3f314cf8d1..593714afaed2 100644 --- a/Documentation/ABI/testing/debugfs-hisi-zip +++ b/Documentation/ABI/testing/debugfs-hisi-zip @@ -94,7 +94,7 @@ What: /sys/kernel/debug/hisi_zip/<bdf>/qm/status Date: Apr 2020 Contact: linux-crypto@vger.kernel.org Description: Dump the status of the QM. - Four states: initiated, started, stopped and closed. + Two states: work, stop. Available for both PF and VF, and take no other effect on ZIP. What: /sys/kernel/debug/hisi_zip/<bdf>/qm/diff_regs diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps index 8757dcf41c08..a5f506f7d481 100644 --- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps +++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps @@ -16,3 +16,9 @@ Description: Example output in powerpc: grep . /sys/bus/event_source/devices/cpu/caps/* /sys/bus/event_source/devices/cpu/caps/pmu_name:POWER9 + + The "branch_counter_nr" in the supported platform exposes the + maximum number of counters which can be shown in the u64 counters + of PERF_SAMPLE_BRANCH_COUNTERS, while the "branch_counter_width" + exposes the width of each counter. Both of them can be used by + the perf tool to parse the logged counters in each branch. diff --git a/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor b/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor index c800621eff95..9ed5582ddea2 100644 --- a/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor +++ b/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor @@ -25,6 +25,9 @@ KernelVersion: 5.14 Contact: linux-mtd@lists.infradead.org Description: (RO) Part name of the SPI NOR flash. + The attribute is optional. User space should not rely on + it to be present or even correct. Instead, user space + should read the jedec_id attribute. What: /sys/bus/spi/devices/.../spi-nor/sfdp Date: April 2021 diff --git a/Documentation/ABI/testing/sysfs-class-devfreq b/Documentation/ABI/testing/sysfs-class-devfreq index 5e6b74f30406..1e7e0bb4c14e 100644 --- a/Documentation/ABI/testing/sysfs-class-devfreq +++ b/Documentation/ABI/testing/sysfs-class-devfreq @@ -52,6 +52,9 @@ Description: echo 0 > /sys/class/devfreq/.../trans_stat + If the transition table is bigger than PAGE_SIZE, reading + this will return an -EFBIG error. + What: /sys/class/devfreq/.../available_frequencies Date: October 2012 Contact: Nishanth Menon <nm@ti.com> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index b35649a46a2f..bfa5b8288d8d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -25,12 +25,14 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or stops, respectively. Reading the file returns the keywords based on the current status. Writing 'commit' to this file makes the kdamond reads the user inputs in the sysfs files - except 'state' again. Writing 'update_schemes_stats' to the - file updates contents of schemes stats files of the kdamond. - Writing 'update_schemes_tried_regions' to the file updates - contents of 'tried_regions' directory of every scheme directory - of this kdamond. Writing 'update_schemes_tried_bytes' to the - file updates only '.../tried_regions/total_bytes' files of this + except 'state' again. Writing 'commit_schemes_quota_goals' to + this file makes the kdamond reads the quota goal files again. + Writing 'update_schemes_stats' to the file updates contents of + schemes stats files of the kdamond. Writing + 'update_schemes_tried_regions' to the file updates contents of + 'tried_regions' directory of every scheme directory of this + kdamond. Writing 'update_schemes_tried_bytes' to the file + updates only '.../tried_regions/total_bytes' files of this kdamond. Writing 'clear_schemes_tried_regions' to the file removes contents of the 'tried_regions' directory. @@ -212,6 +214,25 @@ Contact: SeongJae Park <sj@kernel.org> Description: Writing to and reading from this file sets and gets the quotas charge reset interval of the scheme in milliseconds. +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/nr_goals +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing a number 'N' to this file creates the number of + directories for setting automatic tuning of the scheme's + aggressiveness named '0' to 'N-1' under the goals/ directory. + +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/target_value +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the target + value of the goal metric. + +What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/goals/<G>/current_value +Date: Nov 2023 +Contact: SeongJae Park <sj@kernel.org> +Description: Writing to and reading from this file sets and gets the current + value of the goal metric. + What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil Date: Mar 2022 Contact: SeongJae Park <sj@kernel.org> diff --git a/Documentation/ABI/testing/sysfs-platform-silicom b/Documentation/ABI/testing/sysfs-platform-silicom new file mode 100644 index 000000000000..2288b3665d16 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-platform-silicom @@ -0,0 +1,29 @@ +What: /sys/devices/platform/silicom-platform/uc_version +Date: November 2023 +KernelVersion: 6.7 +Contact: Henry Shi <henrys@silicom-usa.com> +Description: + This file allows to read microcontroller firmware + version of current platform. + +What: /sys/devices/platform/silicom-platform/power_cycle +Date: November 2023 +KernelVersion: 6.7 +Contact: Henry Shi <henrys@silicom-usa.com> + This file allow user to power cycle the platform. + Default value is 0; when set to 1, it powers down + the platform, waits 5 seconds, then powers on the + device. It returns to default value after power cycle. + + 0 - default value. + +What: /sys/devices/platform/silicom-platform/efuse_status +Date: November 2023 +KernelVersion: 6.7 +Contact: Henry Shi <henrys@silicom-usa.com> +Description: + This file is read only. It returns the current + OTP status: + + 0 - not programmed. + 1 - programmed. diff --git a/Documentation/RAS/ras.rst b/Documentation/RAS/ras.rst new file mode 100644 index 000000000000..2556b397cd27 --- /dev/null +++ b/Documentation/RAS/ras.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Reliability, Availability and Serviceability features +===================================================== + +This documents different aspects of the RAS functionality present in the +kernel. + +Error decoding +--------------- + +* x86 + +Error decoding on AMD systems should be done using the rasdaemon tool: +https://github.com/mchehab/rasdaemon/ + +While the daemon is running, it would automatically log and decode +errors. If not, one can still decode such errors by supplying the +hardware information from the error:: + + $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca + +Also, the user can pass particular family and model to decode the error +string:: + + $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM> diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index e4551579cb12..ee2b0030d416 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -328,7 +328,7 @@ as idle:: From now on, any pages on zram are idle pages. The idle mark will be removed until someone requests access of the block. IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be marked as idle based on how long (in seconds) it's been since they were last accessed:: diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3f85254f3cef..17e6e9565156 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1093,7 +1093,11 @@ All time durations are in microseconds. A read-write single value file which exists on non-root cgroups. The default is "100". - The weight in the range [1, 10000]. + For non idle groups (cpu.idle = 0), the weight is in the + range [1, 10000]. + + If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), + then the weight will show as a 0. cpu.weight.nice A read-write single value file which exists on non-root @@ -1157,6 +1161,16 @@ All time durations are in microseconds. values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp. + cpu.idle + A read-write single value file which exists on non-root cgroups. + The default is 0. + + This is the cgroup analog of the per-task SCHED_IDLE sched policy. + Setting this value to a 1 will make the scheduling policy of the + cgroup SCHED_IDLE. The threads inside the cgroup will retain their + own relative priorities, but the cgroup itself will be treated as + very low priority relative to its peers. + Memory @@ -1679,6 +1693,21 @@ PAGE_SIZE multiple when read back. limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk. + memory.zswap.writeback + A read-write single value file. The default value is "1". The + initial value of the root cgroup is 1, and when a new cgroup is + created, it inherits the current value of its parent. + + When this is set to 0, all swapping attempts to swapping devices + are disabled. This included both zswap writebacks, and swapping due + to zswap store failures. If the zswap store failures are recurring + (for e.g if the pages are incompressible), users can observe + reclaim inefficiency after disabling writeback (because the same + pages might be rejected again and again). + + Note that this is subtly different from setting memory.swap.max to + 0, as it still allows for pages to be written to the zswap pool. + memory.pressure A read-only nested-keyed file. @@ -2316,6 +2345,13 @@ Cpuset Interface Files treated to have an implicit value of "cpuset.cpus" in the formation of local partition. + cpuset.cpus.isolated + A read-only and root cgroup only multiple values file. + + This file shows the set of all isolated CPUs used in existing + isolated partitions. It will be empty if no isolated partition + is created. + cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup @@ -2358,11 +2394,11 @@ Cpuset Interface Files partition or scheduling domain. The set of exclusive CPUs is determined by the value of its "cpuset.cpus.exclusive.effective". - When set to "isolated", the CPUs in that partition will - be in an isolated state without any load balancing from the - scheduler. Tasks placed in such a partition with multiple - CPUs should be carefully distributed and bound to each of the - individual CPUs for optimal performance. + When set to "isolated", the CPUs in that partition will be in + an isolated state without any load balancing from the scheduler + and excluded from the unbound workqueues. Tasks placed in such + a partition with multiple CPUs should be carefully distributed + and bound to each of the individual CPUs for optimal performance. A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 43ea35613dfc..fb40a1f6f79e 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -119,6 +119,7 @@ configure specific aspects of kernel behavior to your liking. parport perf-security pm/index + pmf pnp rapidio ras diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 78e4d2e7ba14..bced9e4b6e08 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -172,7 +172,7 @@ variables. Offset of the free_list's member. This value is used to compute the number of free pages. -Each zone has a free_area structure array called free_area[MAX_ORDER + 1]. +Each zone has a free_area structure array called free_area[NR_PAGE_ORDERS]. The free_list represents a linked list of free page blocks. (list_head, next|prev) @@ -189,11 +189,11 @@ Offsets of the vmap_area's members. They carry vmalloc-specific information. Makedumpfile gets the start address of the vmalloc region from this. -(zone.free_area, MAX_ORDER + 1) -------------------------------- +(zone.free_area, NR_PAGE_ORDERS) +-------------------------------- Free areas descriptor. User-space tools use this value to iterate the -free_area ranges. MAX_ORDER is used by the zone buddy allocator. +free_area ranges. NR_PAGE_ORDERS is used by the zone buddy allocator. prb --- diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 65731b060e3f..e0891ac76ab3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -970,17 +970,17 @@ buddy allocator. Bigger value increase the probability of catching random memory corruption, but reduce the amount of memory for normal system use. The maximum - possible value is MAX_ORDER/2. Setting this parameter - to 1 or 2 should be enough to identify most random - memory corruption problems caused by bugs in kernel or - driver code when a CPU writes to (or reads from) a - random memory location. Note that there exists a class - of memory corruptions problems caused by buggy H/W or - F/W or by drivers badly programming DMA (basically when - memory is written at bus level and the CPU MMU is - bypassed) which are not detectable by - CONFIG_DEBUG_PAGEALLOC, hence this option will not help - tracking down these problems. + possible value is MAX_PAGE_ORDER/2. Setting this + parameter to 1 or 2 should be enough to identify most + random memory corruption problems caused by bugs in + kernel or driver code when a CPU writes to (or reads + from) a random memory location. Note that there exists + a class of memory corruptions problems caused by buggy + H/W or F/W or by drivers badly programming DMA + (basically when memory is written at bus level and the + CPU MMU is bypassed) which are not detectable by + CONFIG_DEBUG_PAGEALLOC, hence this option will not + help tracking down these problems. debug_pagealloc= [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter @@ -4136,7 +4136,7 @@ [KNL] Minimal page reporting order Format: <integer> Adjust the minimal page reporting order. The page - reporting is disabled when it exceeds MAX_ORDER. + reporting is disabled when it exceeds MAX_PAGE_ORDER. panic= [KNL] Kernel behaviour on panic: delay <timeout> timeout > 0: seconds before rebooting @@ -5544,6 +5544,13 @@ print every Nth verbose statement, where N is the value specified. + regulator_ignore_unused + [REGULATOR] + Prevents regulator framework from disabling regulators + that are unused, due no driver claiming them. This may + be useful for debug and development, but should not be + needed on a platform with proper driver support. + relax_domain_level= [KNL, SMP] Set scheduler's default relax_domain_level. See Documentation/admin-guide/cgroup-v1/cpusets.rst. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index da94feb97ed1..9d23144bf985 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -59,41 +59,47 @@ Files Hierarchy The files hierarchy of DAMON sysfs interface is shown below. In the below figure, parents-children relations are represented with indentations, each directory is having ``/`` suffix, and files in each directory are separated by -comma (","). :: - - /sys/kernel/mm/damon/admin - │ kdamonds/nr_kdamonds - │ │ 0/state,pid - │ │ │ contexts/nr_contexts - │ │ │ │ 0/avail_operations,operations - │ │ │ │ │ monitoring_attrs/ +comma (","). + +.. parsed-literal:: + + :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin + │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds + │ │ :ref:`0 <sysfs_kdamond>`/state,pid + │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts + │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations + │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us │ │ │ │ │ │ nr_regions/min,max - │ │ │ │ │ targets/nr_targets - │ │ │ │ │ │ 0/pid_target - │ │ │ │ │ │ │ regions/nr_regions - │ │ │ │ │ │ │ │ 0/start,end + │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets + │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target + │ │ │ │ │ │ │ :ref:`regions <sysfs_regions>`/nr_regions + │ │ │ │ │ │ │ │ :ref:`0 <sysfs_region>`/start,end │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... - │ │ │ │ │ schemes/nr_schemes - │ │ │ │ │ │ 0/action,apply_interval_us - │ │ │ │ │ │ │ access_pattern/ + │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes + │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us + │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max │ │ │ │ │ │ │ │ age/min,max - │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms + │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil - │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ filters/nr_filters + │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals + │ │ │ │ │ │ │ │ │ 0/target_value,current_value + │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low + │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds - │ │ │ │ │ │ │ tried_regions/total_bytes + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... │ │ ... +.. _sysfs_root: + Root ---- @@ -102,6 +108,8 @@ has one directory named ``admin``. The directory contains the files for privileged user space programs' control of DAMON. User space tools or daemons having the root permission could use this directory. +.. _sysfs_kdamonds: + kdamonds/ --------- @@ -113,6 +121,8 @@ details) exists. In the beginning, this directory has only one file, child directories named ``0`` to ``N-1``. Each directory represents each kdamond. +.. _sysfs_kdamond: + kdamonds/<N>/ ------------- @@ -120,29 +130,37 @@ In each kdamond directory, two files (``state`` and ``pid``) and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or -``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be -in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the -user inputs in the sysfs files except ``state`` file again. Writing -``update_schemes_stats`` to ``state`` file updates the contents of stats files -for each DAMON-based operation scheme of the kdamond. For details of the -stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. - -Writing ``update_schemes_tried_regions`` to ``state`` file updates the -DAMON-based operation scheme action tried regions directory for each -DAMON-based operation scheme of the kdamond. Writing -``update_schemes_tried_bytes`` to ``state`` file updates only -``.../tried_regions/total_bytes`` files. Writing -``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based -operating scheme action tried regions directory for each DAMON-based operation -scheme of the kdamond. For details of the DAMON-based operation scheme action -tried regions directory, please refer to :ref:`tried_regions section -<sysfs_schemes_tried_regions>`. +``off`` if it is not running. + +Users can write below commands for the kdamond to the ``state`` file. + +- ``on``: Start running. +- ``off``: Stop running. +- ``commit``: Read the user inputs in the sysfs files except ``state`` file + again. +- ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' + :ref:`quota goals <sysfs_schemes_quota_goals>`. +- ``update_schemes_stats``: Update the contents of stats files for each + DAMON-based operation scheme of the kdamond. For details of the stats, + please refer to :ref:`stats section <sysfs_schemes_stats>`. +- ``update_schemes_tried_regions``: Update the DAMON-based operation scheme + action tried regions directory for each DAMON-based operation scheme of the + kdamond. For details of the DAMON-based operation scheme action tried + regions directory, please refer to + :ref:`tried_regions section <sysfs_schemes_tried_regions>`. +- ``update_schemes_tried_bytes``: Update only ``.../tried_regions/total_bytes`` + files. +- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme + action tried regions directory for each DAMON-based operation scheme of the + kdamond. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. +.. _sysfs_contexts: + kdamonds/<N>/contexts/ ---------------------- @@ -153,7 +171,7 @@ number (``N``) to the file creates the number of child directories named as details). At the moment, only one context per kdamond is supported, so only ``0`` or ``1`` can be written to the file. -.. _sysfs_contexts: +.. _sysfs_context: contexts/<N>/ ------------- @@ -203,6 +221,8 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _sysfs_targets: + contexts/<N>/targets/ --------------------- @@ -210,6 +230,8 @@ In the beginning, this directory has only one file, ``nr_targets``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each monitoring target. +.. _sysfs_target: + targets/<N>/ ------------ @@ -244,6 +266,8 @@ In the beginning, this directory has only one file, ``nr_regions``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each initial monitoring target region. +.. _sysfs_region: + regions/<N>/ ------------ @@ -254,6 +278,8 @@ region by writing to and reading from the files, respectively. Each region should not overlap with others. ``end`` of directory ``N`` should be equal or smaller than ``start`` of directory ``N+1``. +.. _sysfs_schemes: + contexts/<N>/schemes/ --------------------- @@ -265,6 +291,8 @@ In the beginning, this directory has only one file, ``nr_schemes``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each DAMON-based operation scheme. +.. _sysfs_scheme: + schemes/<N>/ ------------ @@ -277,7 +305,7 @@ The ``action`` file is for setting and getting the scheme's :ref:`action from the file and their meaning are as below. Note that support of each action depends on the running DAMON operations set -:ref:`implementation <sysfs_contexts>`. +:ref:`implementation <sysfs_context>`. - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Supported by ``vaddr`` and ``fvaddr`` operations set. @@ -299,6 +327,8 @@ Note that support of each action depends on the running DAMON operations set The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval <damon_design_damos>` in microseconds. +.. _sysfs_access_pattern: + schemes/<N>/access_pattern/ --------------------------- @@ -312,6 +342,8 @@ to and reading from the ``min`` and ``max`` files under ``sz``, ``nr_accesses``, and ``age`` directories, respectively. Note that the ``min`` and the ``max`` form a closed interval. +.. _sysfs_quotas: + schemes/<N>/quotas/ ------------------- @@ -319,8 +351,7 @@ The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given DAMON-based operation scheme. Under ``quotas`` directory, three files (``ms``, ``bytes``, -``reset_interval_ms``) and one directory (``weights``) having three files -(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist. +``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist. You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and ``reset interval`` in milliseconds by writing the values to the three files, @@ -330,11 +361,37 @@ apply the action to only up to ``bytes`` bytes of memory regions within the ``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the quota limits. -You can also set the :ref:`prioritization weights +Under ``weights`` directory, three files (``sz_permil``, +``nr_accesses_permil``, and ``age_permil``) exist. +You can set the :ref:`prioritization weights <damon_design_damos_quotas_prioritization>` for size, access frequency, and age in per-thousand unit by writing the values to the three files under the ``weights`` directory. +.. _sysfs_schemes_quota_goals: + +schemes/<N>/quotas/goals/ +------------------------- + +The directory for the :ref:`automatic quota tuning goals +<damon_design_damos_quotas_auto_tuning>` of the given DAMON-based operation +scheme. + +In the beginning, this directory has only one file, ``nr_goals``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each goal and current achievement. +Among the multiple feedback, the best one is used. + +Each goal directory contains two files, namely ``target_value`` and +``current_value``. Users can set and get any number to those files to set the +feedback. User space main workload's latency or throughput, system metrics +like free memory ratio or memory pressure stall time (PSI) could be example +metrics for the values. Note that users should write +``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond +directory <sysfs_kdamond>` to pass the feedback to DAMON. + +.. _sysfs_watermarks: + schemes/<N>/watermarks/ ----------------------- @@ -354,6 +411,8 @@ as below. The ``interval`` should written in microseconds unit. +.. _sysfs_filters: + schemes/<N>/filters/ -------------------- @@ -394,7 +453,7 @@ pages of all memory cgroups except ``/having_care_already``.:: echo N > 1/matching Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation <sysfs_contexts>` is being used. +``paddr`` :ref:`implementation <sysfs_context>` is being used. Also, memory regions that are filtered out by ``addr`` or ``target`` filters are not counted as the scheme has tried to those, while regions that filtered @@ -449,6 +508,8 @@ and query-like efficient data access monitoring results retrievals. For the latter use case, in particular, users can set the ``action`` as ``stat`` and set the ``access pattern`` as their interested pattern that they want to query. +.. _sysfs_schemes_tried_region: + tried_regions/<N>/ ------------------ diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index e59231ac6bb7..a639cac12477 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -80,6 +80,9 @@ pages_to_scan how many pages to scan before ksmd goes to sleep e.g. ``echo 100 > /sys/kernel/mm/ksm/pages_to_scan``. + The pages_to_scan value cannot be changed if ``advisor_mode`` has + been set to scan-time. + Default: 100 (chosen for demonstration purposes) sleep_millisecs @@ -164,6 +167,29 @@ smart_scan optimization is enabled. The ``pages_skipped`` metric shows how effective the setting is. +advisor_mode + The ``advisor_mode`` selects the current advisor. Two modes are + supported: none and scan-time. The default is none. By setting + ``advisor_mode`` to scan-time, the scan time advisor is enabled. + The section about ``advisor`` explains in detail how the scan time + advisor works. + +adivsor_max_cpu + specifies the upper limit of the cpu percent usage of the ksmd + background thread. The default is 70. + +advisor_target_scan_time + specifies the target scan time in seconds to scan all the candidate + pages. The default value is 200 seconds. + +advisor_min_pages_to_scan + specifies the lower limit of the ``pages_to_scan`` parameter of the + scan time advisor. The default is 500. + +adivsor_max_pages_to_scan + specifies the upper limit of the ``pages_to_scan`` parameter of the + scan time advisor. The default is 30000. + The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``: general_profit @@ -263,6 +289,35 @@ ksm_swpin_copy note that KSM page might be copied when swapping in because do_swap_page() cannot do all the locking needed to reconstitute a cross-anon_vma KSM page. +Advisor +======= + +The number of candidate pages for KSM is dynamic. It can be often observed +that during the startup of an application more candidate pages need to be +processed. Without an advisor the ``pages_to_scan`` parameter needs to be +sized for the maximum number of candidate pages. The scan time advisor can +changes the ``pages_to_scan`` parameter based on demand. + +The advisor can be enabled, so KSM can automatically adapt to changes in the +number of candidate pages to scan. Two advisors are implemented: none and +scan-time. With none, no advisor is enabled. The default is none. + +The scan time advisor changes the ``pages_to_scan`` parameter based on the +observed scan times. The possible values for the ``pages_to_scan`` parameter is +limited by the ``advisor_max_cpu`` parameter. In addition there is also the +``advisor_target_scan_time`` parameter. This parameter sets the target time to +scan all the KSM candidate pages. The parameter ``advisor_target_scan_time`` +decides how aggressive the scan time advisor scans candidate pages. Lower +values make the scan time advisor to scan more aggresively. This is the most +important parameter for the configuration of the scan time advisor. + +The initial value and the maximum value can be changed with +``advisor_min_pages_to_scan`` and ``advisor_max_pages_to_scan``. The default +values are sufficient for most workloads and use cases. + +The ``pages_to_scan`` parameter is re-calculated after a scan has been completed. + + -- Izik Eidus, Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index fe17cf210426..f5f065c67615 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -253,6 +253,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b0cc8243e093..04eb45a2f940 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -45,10 +45,25 @@ components: the two is using hugepages just because of the fact the TLB miss is going to run faster. +Modern kernels support "multi-size THP" (mTHP), which introduces the +ability to allocate memory in blocks that are bigger than a base page +but smaller than traditional PMD-size (as described above), in +increments of a power-of-2 number of pages. mTHP can back anonymous +memory (for example 16K, 32K, 64K, etc). These THPs continue to be +PTE-mapped, but in many cases can still provide similar benefits to +those outlined above: Page faults are significantly reduced (by a +factor of e.g. 4, 8, 16, etc), but latency spikes are much less +prominent because the size of each page isn't as huge as the PMD-sized +variant and there is less memory to clear in each page fault. Some +architectures also employ TLB compression mechanisms to squeeze more +entries in when a set of PTEs are virtually and physically contiguous +and approporiately aligned. In this case, TLB misses will occur less +often. + THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into huge pages. +collapses sequences of basic pages into PMD-sized huge pages. The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` interface and using madvise(2) and prctl(2) system calls. @@ -95,12 +110,40 @@ Global THP controls Transparent Hugepage Support for anonymous memory can be entirely disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled -system wide. This can be achieved with one of:: +system wide. This can be achieved per-supported-THP-size with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + +where <size> is the hugepage size being addressed, the available sizes +for which vary by system. + +For example:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled + +Alternatively it is possible to specify that a given hugepage size +will inherit the top-level "enabled" value:: + + echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled + +For example:: + + echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled + +The top-level setting (for use with "inherit") can be set by issuing +one of the following commands:: echo always >/sys/kernel/mm/transparent_hugepage/enabled echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo never >/sys/kernel/mm/transparent_hugepage/enabled +By default, PMD-sized hugepages have enabled="inherit" and all other +hugepage sizes have enabled="never". If enabling multiple hugepage +sizes, the kernel will select the most appropriate enabled size for a +given allocation. + It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular @@ -146,25 +189,34 @@ madvise never should be self-explanatory. -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: +By default kernel tries to use huge, PMD-mappable zero page on read +page fault to anonymous mapping. It's possible to disable huge zero +page by writing 0 or enable it back by writing 1:: echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page -Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage:: +Some userspace (such as a test program, or an optimized memory +allocation library) may want to know the size (in bytes) of a +PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". +khugepaged will be automatically started when one or more hugepage +sizes are enabled (either by directly setting "always" or "madvise", +or by setting "inherit" while the top-level enabled is set to "always" +or "madvise"), and it'll be automatically shutdown when the last +hugepage size is disabled (either by directly setting "never", or by +setting "inherit" while the top-level enabled is set to "never"). Khugepaged controls ------------------- +.. note:: + khugepaged currently only searches for opportunities to collapse to + PMD-sized THP and no attempt is made to collapse to other THP + sizes. + khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -282,19 +334,26 @@ force Need of application restart =========================== -The transparent_hugepage/enabled values and tmpfs mount option only affect -future behavior. So to make them effective you need to restart any -application that could have been using hugepages. This also applies to the -regions registered in khugepaged. +The transparent_hugepage/enabled and +transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount +option only affect future behavior. So to make them effective you need +to restart any application that could have been using hugepages. This +also applies to the regions registered in khugepaged. Monitoring usage ================ -The number of anonymous transparent huge pages currently used by the +.. note:: + Currently the below counters only record events relating to + PMD-sized THP. Events relating to other THP sizes are not included. + +The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields -for each mapping. +To identify what applications are using PMD-sized anonymous transparent huge +pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages +fields for each mapping. (Note that AnonHugePages only applies to traditional +PMD-sized THP for historical reasons and should have been called +AnonHugePmdMapped). The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. @@ -413,7 +472,7 @@ for huge pages. Optimizing the applications =========================== -To be guaranteed that the kernel will map a 2M page immediately in any +To be guaranteed that the kernel will map a THP immediately in any memory region, the mmap region has to be hugepage naturally aligned. posix_memalign() can provide that guarantee. diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 203e26da5f92..e5cc8848dcb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -113,6 +113,9 @@ events, except page fault notifications, may be generated: areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating support for shmem virtual memory areas. +- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an + existing page contents from userspace. + The userland application should set the feature flags it intends to use when invoking the ``UFFDIO_API`` ioctl, to request that those features be enabled if supported. diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 45b98390e938..b42132969e31 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -153,6 +153,26 @@ attribute, e. g.:: Setting this parameter to 100 will disable the hysteresis. +Some users cannot tolerate the swapping that comes with zswap store failures +and zswap writebacks. Swapping can be disabled entirely (without disabling +zswap itself) on a cgroup-basis as follows: + + echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback + +Note that if the store failures are recurring (for e.g if the pages are +incompressible), users can observe reclaim inefficiency after disabling +writeback (because the same pages might be rejected again and again). + +When there is a sizable amount of cold memory residing in the zswap pool, it +can be advantageous to proactively write these cold pages to swap and reclaim +the memory for other use cases. By default, the zswap shrinker is disabled. +User can enable it as follows: + + echo Y > /sys/module/zswap/parameters/shrinker_enabled + +This can be enabled at the boot time if ``CONFIG_ZSWAP_SHRINKER_DEFAULT_ON`` is +selected. + A debugfs interface is provided for various statistic about pool size, number of pages stored, same-value filled pages and various counters for the reasons pages are rejected. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst new file mode 100644 index 000000000000..d47cd229d710 --- /dev/null +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -0,0 +1,94 @@ +====================================================================== +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU) +====================================================================== + +DesignWare Cores (DWC) PCIe PMU +=============================== + +The PMU is a PCIe configuration space register block provided by each PCIe Root +Port in a Vendor-Specific Extended Capability named RAS D.E.S (Debug, Error +injection, and Statistics). + +As the name indicates, the RAS DES capability supports system level +debugging, AER error injection, and collection of statistics. To facilitate +collection of statistics, Synopsys DesignWare Cores PCIe controller +provides the following two features: + +- one 64-bit counter for Time Based Analysis (RX/TX data throughput and + time spent in each low-power LTSSM state) and +- one 32-bit counter for Event Counting (error and non-error events for + a specified lane) + +Note: There is no interrupt for counter overflow. + +Time Based Analysis +------------------- + +Using this feature you can obtain information regarding RX/TX data +throughput and time spent in each low-power LTSSM state by the controller. +The PMU measures data in two categories: + +- Group#0: Percentage of time the controller stays in LTSSM states. +- Group#1: Amount of data processed (Units of 16 bytes). + +Lane Event counters +------------------- + +Using this feature you can obtain Error and Non-Error information in +specific lane by the controller. The PMU event is selected by all of: + +- Group i +- Event j within the Group i +- Lane k + +Some of the events only exist for specific configurations. + +DesignWare Cores (DWC) PCIe PMU Driver +======================================= + +This driver adds PMU devices for each PCIe Root Port named based on the BDF of +the Root Port. For example, + + 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) + +the PMU device name for this Root Port is dwc_rootport_3018. + +The DWC PCIe PMU driver registers a perf PMU driver, which provides +description of available events and configuration options in sysfs, see +/sys/bus/event_source/devices/dwc_rootport_{bdf}. + +The "format" directory describes format of the config fields of the +perf_event_attr structure. The "events" directory provides configuration +templates for all documented events. For example, +"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1". + +The "perf list" command shall list the available events from sysfs, e.g.:: + + $# perf list | grep dwc_rootport + <...> + dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] + <...> + dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event] + +Time Based Analysis Event Usage +------------------------------- + +Example usage of counting PCIe RX TLP data payload (Units of bytes):: + + $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ + +The average RX/TX bandwidth can be calculated using the following formula: + + PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window + PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window + +Lane Event Usage +------------------------------- + +Each lane has the same event set and to avoid generating a list of hundreds +of events, the user need to specify the lane ID explicitly, e.g.:: + + $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/ + +The driver does not support sampling, therefore "perf record" will not +work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst index 90926d0fb8ec..77418ae5a290 100644 --- a/Documentation/admin-guide/perf/imx-ddr.rst +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -13,8 +13,8 @@ is one register for each counter. Counter 0 is special in that it always counts interrupt is raised. If any other counter overflows, it continues counting, and no interrupt is raised. -The "format" directory describes format of the config (event ID) and config1 -(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/ +The "format" directory describes format of the config (event ID) and config1/2 +(AXI filter setting) fields of the perf_event_attr structure, see /sys/bus/event_source/ devices/imx8_ddr0/format/. The "events" directory describes the events types hardware supported that can be used with perf tool, see /sys/bus/event_source/ devices/imx8_ddr0/events/. The "caps" directory describes filter features implemented @@ -28,12 +28,11 @@ in DDR PMU, see /sys/bus/events_source/devices/imx8_ddr0/caps/. AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write) to count reading or writing matches filter setting. Filter setting is various from different DRAM controller implementations, which is distinguished by quirks -in the driver. You also can dump info from userspace, filter in "caps" directory -indicates whether PMU supports AXI ID filter or not; enhanced_filter indicates -whether PMU supports enhanced AXI ID filter or not. Value 0 for un-supported, and -value 1 for supported. +in the driver. You also can dump info from userspace, "caps" directory show the +type of AXI filter (filter, enhanced_filter and super_filter). Value 0 for +un-supported, and value 1 for supported. -* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0). +* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0, super_filter: 0). Filter is defined with two configuration parts: --AXI_ID defines AxID matching value. --AXI_MASKING defines which bits of AxID are meaningful for the matching. @@ -65,7 +64,37 @@ value 1 for supported. perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12 -* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1). +* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1, super_filter: 0). This is an extension to the DDR_CAP_AXI_ID_FILTER quirk which permits counting the number of bytes (as opposed to the number of bursts) from DDR read and write transactions concurrently with another set of data counters. + +* With DDR_CAP_AXI_ID_PORT_CHANNEL_FILTER quirk(filter: 0, enhanced_filter: 0, super_filter: 1). + There is a limitation in previous AXI filter, it cannot filter different IDs + at the same time as the filter is shared between counters. This quirk is the + extension of AXI ID filter. One improvement is that counter 1-3 has their own + filter, means that it supports concurrently filter various IDs. Another + improvement is that counter 1-3 supports AXI PORT and CHANNEL selection. Support + selecting address channel or data channel. + + Filter is defined with 2 configuration registers per counter 1-3. + --Counter N MASK COMP register - including AXI_ID and AXI_MASKING. + --Counter N MUX CNTL register - including AXI CHANNEL and AXI PORT. + + - 0: address channel + - 1: data channel + + PMU in DDR subsystem, only one single port0 exists, so axi_port is reserved + which should be 0. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD,axi_channel=0xH/ cmd + perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD,axi_channel=0xH/ cmd + + .. note:: + + axi_channel is inverted in userspace, and it will be reverted in driver + automatically. So that users do not need specify axi_channel if want to + monitor data channel from DDR transactions, since data channel is more + meaningful. diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index a2e6f2c81146..f4a4513c526f 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -19,6 +19,7 @@ Performance monitor support arm_dsu_pmu thunderx2-pmu alibaba_pmu + dwc_pcie_pmu nvidia-pmu meson-ddr-pmu cxl diff --git a/Documentation/admin-guide/pmf.rst b/Documentation/admin-guide/pmf.rst new file mode 100644 index 000000000000..9ee729ffc19b --- /dev/null +++ b/Documentation/admin-guide/pmf.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Set udev rules for PMF Smart PC Builder +--------------------------------------- + +AMD PMF(Platform Management Framework) Smart PC Solution builder has to set the system states +like S0i3, Screen lock, hibernate etc, based on the output actions provided by the PMF +TA (Trusted Application). + +In order for this to work the PMF driver generates a uevent for userspace to react to. Below are +sample udev rules that can facilitate this experience when a machine has PMF Smart PC solution builder +enabled. + +Please add the following line(s) to +``/etc/udev/rules.d/99-local.rules``:: + + DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="0", RUN+="/usr/bin/systemctl suspend" + DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="1", RUN+="/usr/bin/systemctl hibernate" + DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="2", RUN+="/bin/loginctl lock-sessions" + +EVENT_ID values: +0= Put the system to S0i3/S2Idle +1= Put the system to hibernate +2= Lock the screen diff --git a/Documentation/arch/arm64/arm-acpi.rst b/Documentation/arch/arm64/arm-acpi.rst index a46c34fa9604..e59e4505d0d9 100644 --- a/Documentation/arch/arm64/arm-acpi.rst +++ b/Documentation/arch/arm64/arm-acpi.rst @@ -130,7 +130,7 @@ When an Arm system boots, it can either have DT information, ACPI tables, or in some very unusual cases, both. If no command line parameters are used, the kernel will try to use DT for device enumeration; if there is no DT present, the kernel will try to use ACPI tables, but only if they are present. -In neither is available, the kernel will not boot. If acpi=force is used +If neither is available, the kernel will not boot. If acpi=force is used on the command line, the kernel will attempt to use ACPI tables first, but fall back to DT if there are no ACPI tables present. The basic idea is that the kernel will not fail to boot unless it absolutely has no other choice. diff --git a/Documentation/arch/arm64/perf.rst b/Documentation/arch/arm64/perf.rst index 1f87b57c2332..997fd716b82f 100644 --- a/Documentation/arch/arm64/perf.rst +++ b/Documentation/arch/arm64/perf.rst @@ -164,3 +164,75 @@ and should be used to mask the upper bits as needed. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c .. _tools/lib/perf/tests/test-evsel.c: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/tests/test-evsel.c + +Event Counting Threshold +========================================== + +Overview +-------- + +FEAT_PMUv3_TH (Armv8.8) permits a PMU counter to increment only on +events whose count meets a specified threshold condition. For example if +threshold_compare is set to 2 ('Greater than or equal'), and the +threshold is set to 2, then the PMU counter will now only increment by +when an event would have previously incremented the PMU counter by 2 or +more on a single processor cycle. + +To increment by 1 after passing the threshold condition instead of the +number of events on that cycle, add the 'threshold_count' option to the +commandline. + +How-to +------ + +These are the parameters for controlling the feature: + +.. list-table:: + :header-rows: 1 + + * - Parameter + - Description + * - threshold + - Value to threshold the event by. A value of 0 means that + thresholding is disabled and the other parameters have no effect. + * - threshold_compare + - | Comparison function to use, with the following values supported: + | + | 0: Not-equal + | 1: Equals + | 2: Greater-than-or-equal + | 3: Less-than + * - threshold_count + - If this is set, count by 1 after passing the threshold condition + instead of the value of the event on this cycle. + +The threshold, threshold_compare and threshold_count values can be +provided per event, for example: + +.. code-block:: sh + + perf stat -e stall_slot/threshold=2,threshold_compare=2/ \ + -e dtlb_walk/threshold=10,threshold_compare=3,threshold_count/ + +In this example the stall_slot event will count by 2 or more on every +cycle where 2 or more stalls happen. And dtlb_walk will count by 1 on +every cycle where the number of dtlb walks were less than 10. + +The maximum supported threshold value can be read from the caps of each +PMU, for example: + +.. code-block:: sh + + cat /sys/bus/event_source/devices/armv8_pmuv3/caps/threshold_max + + 0x000000ff + +If a value higher than this is given, then opening the event will result +in an error. The highest possible maximum is 4095, as the config field +for threshold is limited to 12 bits, and the Perf tool will refuse to +parse higher values. + +If the PMU doesn't support FEAT_PMUv3_TH, then threshold_max will read +0, and attempting to set a threshold value will also result in an error. +threshold_max will also read as 0 on aarch32 guests, even if the host +is running on hardware with the feature. diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst index 08246e8ac835..8895784d4784 100644 --- a/Documentation/arch/x86/cpuinfo.rst +++ b/Documentation/arch/x86/cpuinfo.rst @@ -7,27 +7,74 @@ x86 Feature Flags Introduction ============ -On x86, flags appearing in /proc/cpuinfo have an X86_FEATURE definition -in arch/x86/include/asm/cpufeatures.h. If the kernel cares about a feature -or KVM want to expose the feature to a KVM guest, it can and should have -an X86_FEATURE_* defined. These flags represent hardware features as -well as software features. - -If users want to know if a feature is available on a given system, they -try to find the flag in /proc/cpuinfo. If a given flag is present, it -means that the kernel supports it and is currently making it available. -If such flag represents a hardware feature, it also means that the -hardware supports it. - -If the expected flag does not appear in /proc/cpuinfo, things are murkier. -Users need to find out the reason why the flag is missing and find the way -how to enable it, which is not always easy. There are several factors that -can explain missing flags: the expected feature failed to enable, the feature -is missing in hardware, platform firmware did not enable it, the feature is -disabled at build or run time, an old kernel is in use, or the kernel does -not support the feature and thus has not enabled it. In general, /proc/cpuinfo -shows features which the kernel supports. For a full list of CPUID flags -which the CPU supports, use tools/arch/x86/kcpuid. +The list of feature flags in /proc/cpuinfo is not complete and +represents an ill-fated attempt from long time ago to put feature flags +in an easy to find place for userspace. + +However, the amount of feature flags is growing by the CPU generation, +leading to unparseable and unwieldy /proc/cpuinfo. + +What is more, those feature flags do not even need to be in that file +because userspace doesn't care about them - glibc et al already use +CPUID to find out what the target machine supports and what not. + +And even if it doesn't show a particular feature flag - although the CPU +still does have support for the respective hardware functionality and +said CPU supports CPUID faulting - userspace can simply probe for the +feature and figure out if it is supported or not, regardless of whether +it is being advertised somewhere. + +Furthermore, those flag strings become an ABI the moment they appear +there and maintaining them forever when nothing even uses them is a lot +of wasted effort. + +So, the current use of /proc/cpuinfo is to show features which the +kernel has *enabled* and *supports*. As in: the CPUID feature flag is +there, there's an additional setup which the kernel has done while +booting and the functionality is ready to use. A perfect example for +that is "user_shstk" where additional code enablement is present in the +kernel to support shadow stack for user programs. + +So, if users want to know if a feature is available on a given system, +they try to find the flag in /proc/cpuinfo. If a given flag is present, +it means that + +* the kernel knows about the feature enough to have an X86_FEATURE bit + +* the kernel supports it and is currently making it available either to + userspace or some other part of the kernel + +* if the flag represents a hardware feature the hardware supports it. + +The absence of a flag in /proc/cpuinfo by itself means almost nothing to +an end user. + +On the one hand, a feature like "vaes" might be fully available to user +applications on a kernel that has not defined X86_FEATURE_VAES and thus +there is no "vaes" in /proc/cpuinfo. + +On the other hand, a new kernel running on non-VAES hardware would also +have no "vaes" in /proc/cpuinfo. There's no way for an application or +user to tell the difference. + +The end result is that the flags field in /proc/cpuinfo is marginally +useful for kernel debugging, but not really for anything else. +Applications should instead use things like the glibc facilities for +querying CPU support. Users should rely on tools like +tools/arch/x86/kcpuid and cpuid(1). + +Regarding implementation, flags appearing in /proc/cpuinfo have an +X86_FEATURE definition in arch/x86/include/asm/cpufeatures.h. These flags +represent hardware features as well as software features. + +If the kernel cares about a feature or KVM want to expose the feature to +a KVM guest, it should only then expose it to the guest when the guest +needs to parse /proc/cpuinfo. Which, as mentioned above, is highly +unlikely. KVM can synthesize the CPUID bit and the KVM guest can simply +query CPUID and figure out what the hypervisor supports and what not. As +already stated, /proc/cpuinfo is not a dumping ground for useless +feature flags. + How are feature flags created? ============================== diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst index 4b858a9bad8d..e08d35177bc0 100644 --- a/Documentation/arch/x86/pti.rst +++ b/Documentation/arch/x86/pti.rst @@ -81,11 +81,9 @@ this protection comes at a cost: and exit (it can be skipped when the kernel is interrupted, though.) Moves to CR3 are on the order of a hundred cycles, and are required at every entry and exit. - b. A "trampoline" must be used for SYSCALL entry. This - trampoline depends on a smaller set of resources than the - non-PTI SYSCALL entry code, so requires mapping fewer - things into the userspace page tables. The downside is - that stacks must be switched at entry time. + b. Percpu TSS is mapped into the user page tables to allow SYSCALL64 path + to work under PTI. This doesn't have a direct runtime cost but it can + be argued it opens certain timing attack scenarios. c. Global pages are disabled for all kernel structures not mapped into both kernel and userspace page tables. This feature of the MMU allows different processes to share TLB @@ -167,7 +165,7 @@ that are worth noting here. * Failures of the selftests/x86 code. Usually a bug in one of the more obscure corners of entry_64.S * Crashes in early boot, especially around CPU bringup. Bugs - in the trampoline code or mappings cause these. + in the mappings cause these. * Crashes at the first interrupt. Caused by bugs in entry_64.S, like screwing up a page table switch. Also caused by incorrectly mapping the IRQ handler entry code. diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst index 96f3d5f076b5..ccdd1615cf97 100644 --- a/Documentation/core-api/maple_tree.rst +++ b/Documentation/core-api/maple_tree.rst @@ -81,6 +81,9 @@ section. Sometimes it is necessary to ensure the next call to store to a maple tree does not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case. +You can use mtree_dup() to duplicate an entire maple tree. It is a more +efficient way than inserting all elements one by one into a new tree. + Finally, you can remove all entries from a maple tree by calling mtree_destroy(). If the maple tree entries are pointers, you may wish to free the entries first. @@ -112,6 +115,7 @@ Takes ma_lock internally: * mtree_insert() * mtree_insert_range() * mtree_erase() + * mtree_dup() * mtree_destroy() * mt_set_in_rcu() * mt_clear_in_rcu() diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index 2d091c873d1e..af8151db88b2 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -37,7 +37,7 @@ The Slab Cache .. kernel-doc:: include/linux/slab.h :internal: -.. kernel-doc:: mm/slab.c +.. kernel-doc:: mm/slub.c :export: .. kernel-doc:: mm/slab_common.c diff --git a/Documentation/crypto/device_drivers/index.rst b/Documentation/crypto/device_drivers/index.rst new file mode 100644 index 000000000000..c81d311ac61b --- /dev/null +++ b/Documentation/crypto/device_drivers/index.rst @@ -0,0 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Hardware Device Driver Specific Documentation +--------------------------------------------- + +.. toctree:: + :maxdepth: 1 + + octeontx2 diff --git a/Documentation/crypto/device_drivers/octeontx2.rst b/Documentation/crypto/device_drivers/octeontx2.rst new file mode 100644 index 000000000000..7e469b173ac8 --- /dev/null +++ b/Documentation/crypto/device_drivers/octeontx2.rst @@ -0,0 +1,25 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +octeontx2 devlink support +========================= + +This document describes the devlink features implemented by the ``octeontx2 CPT`` +device drivers. + +Parameters +========== + +The ``octeontx2`` driver implements the following driver-specific parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``t106_mode`` + - u8 + - runtime + - Used to configure CN10KA B0/CN10KB CPT to work as CN10KA A0/A1. diff --git a/Documentation/crypto/index.rst b/Documentation/crypto/index.rst index da5d5ad2bdf3..945ca1505ad9 100644 --- a/Documentation/crypto/index.rst +++ b/Documentation/crypto/index.rst @@ -28,3 +28,4 @@ for cryptographic use cases, as well as programming examples. api api-samples descore-readme + device_drivers/index diff --git a/Documentation/dev-tools/kunit/api/resource.rst b/Documentation/dev-tools/kunit/api/resource.rst index 0a94f831259e..ec6002a6b0db 100644 --- a/Documentation/dev-tools/kunit/api/resource.rst +++ b/Documentation/dev-tools/kunit/api/resource.rst @@ -11,3 +11,12 @@ state on a per-test basis, register custom cleanup actions, and more. .. kernel-doc:: include/kunit/resource.h :internal: + +Managed Devices +--------------- + +Functions for using KUnit-managed struct device and struct device_driver. +Include ``kunit/device.h`` to use these. + +.. kernel-doc:: include/kunit/device.h + :internal: diff --git a/Documentation/dev-tools/kunit/run_manual.rst b/Documentation/dev-tools/kunit/run_manual.rst index e7b46421f247..699d92885075 100644 --- a/Documentation/dev-tools/kunit/run_manual.rst +++ b/Documentation/dev-tools/kunit/run_manual.rst @@ -49,9 +49,52 @@ loaded. The results will appear in TAP format in ``dmesg``. +debugfs +======= + +KUnit can be accessed from userspace via the debugfs filesystem (See more +information about debugfs at Documentation/filesystems/debugfs.rst). + +If ``CONFIG_KUNIT_DEBUGFS`` is enabled, the KUnit debugfs filesystem is +mounted at /sys/kernel/debug/kunit. You can use this filesystem to perform +the following actions. + +Retrieve Test Results +===================== + +You can use debugfs to retrieve KUnit test results. The test results are +accessible from the debugfs filesystem in the following read-only file: + +.. code-block :: bash + + /sys/kernel/debug/kunit/<test_suite>/results + +The test results are printed in a KTAP document. Note this document is separate +to the kernel log and thus, may have different test suite numbering. + +Run Tests After Kernel Has Booted +================================= + +You can use the debugfs filesystem to trigger built-in tests to run after +boot. To run the test suite, you can use the following command to write to +the ``/sys/kernel/debug/kunit/<test_suite>/run`` file: + +.. code-block :: bash + + echo "any string" > /sys/kernel/debugfs/kunit/<test_suite>/run + +As a result, the test suite runs and the results are printed to the kernel +log. + +However, this feature is not available with KUnit suites that use init data, +because init data may have been discarded after the kernel boots. KUnit +suites that use init data should be defined using the +kunit_test_init_section_suites() macro. + +Also, you cannot use this feature to run tests concurrently. Instead a test +will wait to run until other tests have completed or failed. + .. note :: - If ``CONFIG_KUNIT_DEBUGFS`` is enabled, KUnit test results will - be accessible from the ``debugfs`` filesystem (if mounted). - They will be in ``/sys/kernel/debug/kunit/<test_suite>/results``, in - TAP format. + For test authors, to use this feature, tests will need to correctly initialise + and/or clean up any data, so the test runs correctly a second time. diff --git a/Documentation/dev-tools/kunit/running_tips.rst b/Documentation/dev-tools/kunit/running_tips.rst index 766f9cdea0fa..024e9ad1d1e9 100644 --- a/Documentation/dev-tools/kunit/running_tips.rst +++ b/Documentation/dev-tools/kunit/running_tips.rst @@ -428,3 +428,10 @@ This attribute indicates the name of the module associated with the test. This attribute is automatically saved as a string and is printed for each suite. Tests can also be filtered using this attribute. + +``is_init`` + +This attribute indicates whether the test uses init data or functions. + +This attribute is automatically saved as a boolean and tests can also be +filtered using this attribute. diff --git a/Documentation/dev-tools/kunit/usage.rst b/Documentation/dev-tools/kunit/usage.rst index c27e1646ecd9..53c6f7dc8a42 100644 --- a/Documentation/dev-tools/kunit/usage.rst +++ b/Documentation/dev-tools/kunit/usage.rst @@ -651,12 +651,16 @@ For example: } Note that, for functions like device_unregister which only accept a single -pointer-sized argument, it's possible to directly cast that function to -a ``kunit_action_t`` rather than writing a wrapper function, for example: +pointer-sized argument, it's possible to automatically generate a wrapper +with the ``KUNIT_DEFINE_ACTION_WRAPPER()`` macro, for example: .. code-block:: C - kunit_add_action(test, (kunit_action_t *)&device_unregister, &dev); + KUNIT_DEFINE_ACTION_WRAPPER(device_unregister, device_unregister_wrapper, struct device *); + kunit_add_action(test, &device_unregister_wrapper, &dev); + +You should do this in preference to manually casting to the ``kunit_action_t`` type, +as casting function pointers will break Control Flow Integrity (CFI). ``kunit_add_action`` can fail if, for example, the system is out of memory. You can use ``kunit_add_action_or_reset`` instead which runs the action @@ -793,3 +797,53 @@ structures as shown below: KUnit is not enabled, or if no test is running in the current task, it will do nothing. This compiles down to either a no-op or a static key check, so will have a negligible performance impact when no test is running. + +Managing Fake Devices and Drivers +--------------------------------- + +When testing drivers or code which interacts with drivers, many functions will +require a ``struct device`` or ``struct device_driver``. In many cases, setting +up a real device is not required to test any given function, so a fake device +can be used instead. + +KUnit provides helper functions to create and manage these fake devices, which +are internally of type ``struct kunit_device``, and are attached to a special +``kunit_bus``. These devices support managed device resources (devres), as +described in Documentation/driver-api/driver-model/devres.rst + +To create a KUnit-managed ``struct device_driver``, use ``kunit_driver_create()``, +which will create a driver with the given name, on the ``kunit_bus``. This driver +will automatically be destroyed when the corresponding test finishes, but can also +be manually destroyed with ``driver_unregister()``. + +To create a fake device, use the ``kunit_device_register()``, which will create +and register a device, using a new KUnit-managed driver created with ``kunit_driver_create()``. +To provide a specific, non-KUnit-managed driver, use ``kunit_device_register_with_driver()`` +instead. Like with managed drivers, KUnit-managed fake devices are automatically +cleaned up when the test finishes, but can be manually cleaned up early with +``kunit_device_unregister()``. + +The KUnit devices should be used in preference to ``root_device_register()``, and +instead of ``platform_device_register()`` in cases where the device is not otherwise +a platform device. + +For example: + +.. code-block:: c + + #include <kunit/device.h> + + static void test_my_device(struct kunit *test) + { + struct device *fake_device; + const char *dev_managed_string; + + // Create a fake device. + fake_device = kunit_device_register(test, "my_device"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, fake_device) + + // Pass it to functions which need a device. + dev_managed_string = devm_kstrdup(fake_device, "Hello, World!"); + + // Everything is cleaned up automatically when the test ends. + }
\ No newline at end of file diff --git a/Documentation/devicetree/bindings/crypto/inside-secure,safexcel.yaml b/Documentation/devicetree/bindings/crypto/inside-secure,safexcel.yaml new file mode 100644 index 000000000000..ef07258d16c1 --- /dev/null +++ b/Documentation/devicetree/bindings/crypto/inside-secure,safexcel.yaml @@ -0,0 +1,86 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/crypto/inside-secure,safexcel.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Inside Secure SafeXcel cryptographic engine + +maintainers: + - Antoine Tenart <atenart@kernel.org> + +properties: + compatible: + oneOf: + - const: inside-secure,safexcel-eip197b + - const: inside-secure,safexcel-eip197d + - const: inside-secure,safexcel-eip97ies + - const: inside-secure,safexcel-eip197 + description: Equivalent of inside-secure,safexcel-eip197b + deprecated: true + - const: inside-secure,safexcel-eip97 + description: Equivalent of inside-secure,safexcel-eip97ies + deprecated: true + + reg: + maxItems: 1 + + interrupts: + maxItems: 6 + + interrupt-names: + items: + - const: ring0 + - const: ring1 + - const: ring2 + - const: ring3 + - const: eip + - const: mem + + clocks: + minItems: 1 + maxItems: 2 + + clock-names: + minItems: 1 + items: + - const: core + - const: reg + +required: + - reg + - interrupts + - interrupt-names + +allOf: + - if: + properties: + clocks: + minItems: 2 + then: + properties: + clock-names: + minItems: 2 + required: + - clock-names + +additionalProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/arm-gic.h> + #include <dt-bindings/interrupt-controller/irq.h> + + crypto@800000 { + compatible = "inside-secure,safexcel-eip197b"; + reg = <0x800000 0x200000>; + interrupts = <GIC_SPI 54 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 57 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 58 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "ring0", "ring1", "ring2", "ring3", "eip", "mem"; + clocks = <&cpm_syscon0 1 26>; + clock-names = "core"; + }; diff --git a/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt b/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt deleted file mode 100644 index 3bbf144c9988..000000000000 --- a/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt +++ /dev/null @@ -1,40 +0,0 @@ -Inside Secure SafeXcel cryptographic engine - -Required properties: -- compatible: Should be "inside-secure,safexcel-eip197b", - "inside-secure,safexcel-eip197d" or - "inside-secure,safexcel-eip97ies". -- reg: Base physical address of the engine and length of memory mapped region. -- interrupts: Interrupt numbers for the rings and engine. -- interrupt-names: Should be "ring0", "ring1", "ring2", "ring3", "eip", "mem". - -Optional properties: -- clocks: Reference to the crypto engine clocks, the second clock is - needed for the Armada 7K/8K SoCs. -- clock-names: mandatory if there is a second clock, in this case the - name must be "core" for the first clock and "reg" for - the second one. - -Backward compatibility: -Two compatibles are kept for backward compatibility, but shouldn't be used for -new submissions: -- "inside-secure,safexcel-eip197" is equivalent to - "inside-secure,safexcel-eip197b". -- "inside-secure,safexcel-eip97" is equivalent to - "inside-secure,safexcel-eip97ies". - -Example: - - crypto: crypto@800000 { - compatible = "inside-secure,safexcel-eip197b"; - reg = <0x800000 0x200000>; - interrupts = <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 54 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 57 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 58 IRQ_TYPE_LEVEL_HIGH>; - interrupt-names = "mem", "ring0", "ring1", "ring2", "ring3", - "eip"; - clocks = <&cpm_syscon0 1 26>; - }; diff --git a/Documentation/devicetree/bindings/crypto/qcom,inline-crypto-engine.yaml b/Documentation/devicetree/bindings/crypto/qcom,inline-crypto-engine.yaml index ca4f7d1cefaa..09e43157cc71 100644 --- a/Documentation/devicetree/bindings/crypto/qcom,inline-crypto-engine.yaml +++ b/Documentation/devicetree/bindings/crypto/qcom,inline-crypto-engine.yaml @@ -16,6 +16,7 @@ properties: - qcom,sa8775p-inline-crypto-engine - qcom,sm8450-inline-crypto-engine - qcom,sm8550-inline-crypto-engine + - qcom,sm8650-inline-crypto-engine - const: qcom,inline-crypto-engine reg: diff --git a/Documentation/devicetree/bindings/crypto/qcom,prng.yaml b/Documentation/devicetree/bindings/crypto/qcom,prng.yaml index 13070db0f70c..89c88004b41b 100644 --- a/Documentation/devicetree/bindings/crypto/qcom,prng.yaml +++ b/Documentation/devicetree/bindings/crypto/qcom,prng.yaml @@ -21,6 +21,7 @@ properties: - qcom,sc7280-trng - qcom,sm8450-trng - qcom,sm8550-trng + - qcom,sm8650-trng - const: qcom,trng reg: diff --git a/Documentation/devicetree/bindings/crypto/qcom-qce.yaml b/Documentation/devicetree/bindings/crypto/qcom-qce.yaml index 8e665d910e6e..a48bd381063a 100644 --- a/Documentation/devicetree/bindings/crypto/qcom-qce.yaml +++ b/Documentation/devicetree/bindings/crypto/qcom-qce.yaml @@ -44,10 +44,12 @@ properties: - items: - enum: + - qcom,sc7280-qce - qcom,sm8250-qce - qcom,sm8350-qce - qcom,sm8450-qce - qcom,sm8550-qce + - qcom,sm8650-qce - const: qcom,sm8150-qce - const: qcom,qce @@ -96,6 +98,7 @@ allOf: - qcom,crypto-v5.4 - qcom,ipq6018-qce - qcom,ipq8074-qce + - qcom,ipq9574-qce - qcom,msm8996-qce - qcom,sdm845-qce then: @@ -129,6 +132,17 @@ allOf: - clocks - clock-names + - if: + properties: + compatible: + contains: + enum: + - qcom,sm8150-qce + then: + properties: + clocks: false + clock-names: false + required: - compatible - reg diff --git a/Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml b/Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml index 73674baea75d..f9160d7bac3c 100644 --- a/Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml +++ b/Documentation/devicetree/bindings/display/panel/panel-simple-dsi.yaml @@ -42,6 +42,8 @@ properties: - lg,acx467akm-7 # LG Corporation 7" WXGA TFT LCD panel - lg,ld070wx3-sl01 + # LG Corporation 5" HD TFT LCD panel + - lg,lh500wx1-sd03 # One Stop Displays OSD101T2587-53TS 10.1" 1920x1200 panel - osddisplays,osd101t2587-53ts # Panasonic 10" WUXGA TFT LCD panel diff --git a/Documentation/devicetree/bindings/display/panel/panel-simple.yaml b/Documentation/devicetree/bindings/display/panel/panel-simple.yaml index 3ec9ee95045f..11422af3477e 100644 --- a/Documentation/devicetree/bindings/display/panel/panel-simple.yaml +++ b/Documentation/devicetree/bindings/display/panel/panel-simple.yaml @@ -208,8 +208,6 @@ properties: - lemaker,bl035-rgb-002 # LG 7" (800x480 pixels) TFT LCD panel - lg,lb070wv8 - # LG Corporation 5" HD TFT LCD panel - - lg,lh500wx1-sd03 # LG LP079QX1-SP0V 7.9" (1536x2048 pixels) TFT LCD panel - lg,lp079qx1-sp0v # LG 9.7" (2048x1536 pixels) TFT LCD panel diff --git a/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml b/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml index 6a206111d4e0..ebb40c48950a 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/qcom,mpm.yaml @@ -29,6 +29,12 @@ properties: maxItems: 1 description: Specifies the base address and size of vMPM registers in RPM MSG RAM. + deprecated: true + + qcom,rpm-msg-ram: + $ref: /schemas/types.yaml#/definitions/phandle + description: + Phandle to the APSS MPM slice of the RPM Message RAM interrupts: maxItems: 1 @@ -67,34 +73,46 @@ properties: required: - compatible - - reg - interrupts - mboxes - interrupt-controller - '#interrupt-cells' - qcom,mpm-pin-count - qcom,mpm-pin-map + - qcom,rpm-msg-ram additionalProperties: false examples: - | #include <dt-bindings/interrupt-controller/arm-gic.h> - mpm: interrupt-controller@45f01b8 { - compatible = "qcom,mpm"; - interrupts = <GIC_SPI 197 IRQ_TYPE_EDGE_RISING>; - reg = <0x45f01b8 0x1000>; - mboxes = <&apcs_glb 1>; - interrupt-controller; - #interrupt-cells = <2>; - interrupt-parent = <&intc>; - qcom,mpm-pin-count = <96>; - qcom,mpm-pin-map = <2 275>, - <5 296>, - <12 422>, - <24 79>, - <86 183>, - <90 260>, - <91 260>; - #power-domain-cells = <0>; + + remoteproc-rpm { + compatible = "qcom,msm8998-rpm-proc", "qcom,rpm-proc"; + + glink-edge { + compatible = "qcom,glink-rpm"; + + interrupts = <GIC_SPI 168 IRQ_TYPE_EDGE_RISING>; + qcom,rpm-msg-ram = <&rpm_msg_ram>; + mboxes = <&apcs_glb 0>; + }; + + mpm: interrupt-controller { + compatible = "qcom,mpm"; + qcom,rpm-msg-ram = <&apss_mpm>; + interrupts = <GIC_SPI 197 IRQ_TYPE_EDGE_RISING>; + mboxes = <&apcs_glb 1>; + interrupt-controller; + #interrupt-cells = <2>; + interrupt-parent = <&intc>; + qcom,mpm-pin-count = <96>; + qcom,mpm-pin-map = <2 275>, + <5 296>, + <12 422>, + <24 79>, + <86 183>, + <91 260>; + #power-domain-cells = <0>; + }; }; diff --git a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml index 2ef3081eaaf3..d3b5aec0a3f7 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/renesas,rzg2l-irqc.yaml @@ -26,6 +26,7 @@ properties: - renesas,r9a07g043u-irqc # RZ/G2UL - renesas,r9a07g044-irqc # RZ/G2{L,LC} - renesas,r9a07g054-irqc # RZ/V2L + - renesas,r9a08g045-irqc # RZ/G3S - const: renesas,rzg2l-irqc '#interrupt-cells': @@ -167,7 +168,9 @@ allOf: properties: compatible: contains: - const: renesas,r9a07g043u-irqc + enum: + - renesas,r9a07g043u-irqc + - renesas,r9a08g045-irqc then: properties: interrupts: diff --git a/Documentation/devicetree/bindings/mtd/partitions/u-boot.yaml b/Documentation/devicetree/bindings/mtd/partitions/u-boot.yaml index 3c56efe48efd..327fa872c001 100644 --- a/Documentation/devicetree/bindings/mtd/partitions/u-boot.yaml +++ b/Documentation/devicetree/bindings/mtd/partitions/u-boot.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: U-Boot bootloader partition description: | - U-Boot is a bootlodaer commonly used in embedded devices. It's almost always + U-Boot is a bootloader commonly used in embedded devices. It's almost always located on some kind of flash device. Device configuration is stored as a set of environment variables that are diff --git a/Documentation/devicetree/bindings/nvmem/mxs-ocotp.yaml b/Documentation/devicetree/bindings/nvmem/mxs-ocotp.yaml index f43186f98607..d9287be89877 100644 --- a/Documentation/devicetree/bindings/nvmem/mxs-ocotp.yaml +++ b/Documentation/devicetree/bindings/nvmem/mxs-ocotp.yaml @@ -15,9 +15,11 @@ allOf: properties: compatible: - enum: - - fsl,imx23-ocotp - - fsl,imx28-ocotp + items: + - enum: + - fsl,imx23-ocotp + - fsl,imx28-ocotp + - const: fsl,ocotp reg: maxItems: 1 @@ -35,7 +37,7 @@ unevaluatedProperties: false examples: - | ocotp: efuse@8002c000 { - compatible = "fsl,imx28-ocotp"; + compatible = "fsl,imx28-ocotp", "fsl,ocotp"; #address-cells = <1>; #size-cells = <1>; reg = <0x8002c000 0x2000>; diff --git a/Documentation/devicetree/bindings/perf/fsl-imx-ddr.yaml b/Documentation/devicetree/bindings/perf/fsl-imx-ddr.yaml index e9fad4b3de68..6c96a4204e5d 100644 --- a/Documentation/devicetree/bindings/perf/fsl-imx-ddr.yaml +++ b/Documentation/devicetree/bindings/perf/fsl-imx-ddr.yaml @@ -27,6 +27,9 @@ properties: - fsl,imx8mq-ddr-pmu - fsl,imx8mp-ddr-pmu - const: fsl,imx8m-ddr-pmu + - items: + - const: fsl,imx8dxl-ddr-pmu + - const: fsl,imx8-ddr-pmu reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml b/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml index ce7751b9129c..9ff9abf2691a 100644 --- a/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/fixed-regulator.yaml @@ -105,6 +105,8 @@ properties: description: Interrupt signaling a critical under-voltage event. + system-critical-regulator: true + required: - compatible - regulator-name diff --git a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml index acd37f28ef53..27c6d5152413 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.yaml @@ -42,6 +42,7 @@ description: | For PM7325, smps1 - smps8, ldo1 - ldo19 For PM8005, smps1 - smps4 For PM8009, smps1 - smps2, ldo1 - ldo7 + For PM8010, ldo1 - ldo7 For PM8150, smps1 - smps10, ldo1 - ldo18 For PM8150L, smps1 - smps8, ldo1 - ldo11, bob, flash, rgb For PM8350, smps1 - smps12, ldo1 - ldo10 @@ -68,6 +69,7 @@ properties: - qcom,pm8005-rpmh-regulators - qcom,pm8009-rpmh-regulators - qcom,pm8009-1-rpmh-regulators + - qcom,pm8010-rpmh-regulators - qcom,pm8150-rpmh-regulators - qcom,pm8150l-rpmh-regulators - qcom,pm8350-rpmh-regulators @@ -242,6 +244,18 @@ allOf: properties: compatible: enum: + - qcom,pm8010-rpmh-regulators + then: + properties: + vdd-l1-l2-supply: true + vdd-l3-l4-supply: true + patternProperties: + "^vdd-l[5-7]-supply$": true + + - if: + properties: + compatible: + enum: - qcom,pm8150-rpmh-regulators - qcom,pmc8180-rpmh-regulators - qcom,pmm8155au-rpmh-regulators diff --git a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.yaml b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.yaml index 9ea8ac0786ac..f2fd2df68a9e 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/qcom,smd-rpm-regulator.yaml @@ -47,6 +47,9 @@ description: For pm8916, s1, s2, s3, s4, l1, l2, l3, l4, l5, l6, l7, l8, l9, l10, l11, l12, l13, l14, l15, l16, l17, l18 + For pm8937, s1, s2, s3, s4, l1, l2, l3, l4, l5, l6, l7, l8, l9, l10, + l11, l12, l13, l14, l15, l16, l17, l18, l19, l20, l21, l22, l23 + For pm8941, s1, s2, s3, s4, l1, l2, l3, l4, l5, l6, l7, l8, l9, l10, l11, l12, l13, l14, l15, l16, l17, l18, l19, l20, l21, l22, l23, l24, lvs1, lvs2, lvs3, 5vs1, 5vs2 @@ -92,6 +95,7 @@ properties: - qcom,rpm-pm8841-regulators - qcom,rpm-pm8909-regulators - qcom,rpm-pm8916-regulators + - qcom,rpm-pm8937-regulators - qcom,rpm-pm8941-regulators - qcom,rpm-pm8950-regulators - qcom,rpm-pm8953-regulators diff --git a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.yaml b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.yaml index 7a1b7d2abbd4..aea849e8eadf 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/qcom,spmi-regulator.yaml @@ -22,6 +22,7 @@ properties: - qcom,pm8841-regulators - qcom,pm8909-regulators - qcom,pm8916-regulators + - qcom,pm8937-regulators - qcom,pm8941-regulators - qcom,pm8950-regulators - qcom,pm8994-regulators @@ -296,6 +297,24 @@ allOf: compatible: contains: enum: + - qcom,pm8937-regulators + then: + properties: + vdd_l1_l19-supply: true + vdd_l20_l21-supply: true + vdd_l2_l23-supply: true + vdd_l3-supply: true + vdd_l4_l5_l6_l7_l16-supply: true + vdd_l8_l11_l12_l17_l22-supply: true + vdd_l9_l10_l13_l14_l15_l18-supply: true + patternProperties: + "^vdd_s[1-6]-supply$": true + + - if: + properties: + compatible: + contains: + enum: - qcom,pm8950-regulators then: properties: diff --git a/Documentation/devicetree/bindings/regulator/qcom,usb-vbus-regulator.yaml b/Documentation/devicetree/bindings/regulator/qcom,usb-vbus-regulator.yaml index 89c564dfa5db..534f87e98716 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,usb-vbus-regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/qcom,usb-vbus-regulator.yaml @@ -36,10 +36,11 @@ unevaluatedProperties: false examples: - | - pm8150b { + pmic { #address-cells = <1>; #size-cells = <0>; - pm8150b_vbus: usb-vbus-regulator@1100 { + + usb-vbus-regulator@1100 { compatible = "qcom,pm8150b-vbus-reg"; reg = <0x1100>; regulator-min-microamp = <500000>; diff --git a/Documentation/devicetree/bindings/regulator/regulator.yaml b/Documentation/devicetree/bindings/regulator/regulator.yaml index 9daf0fc2465f..1ef380d1515e 100644 --- a/Documentation/devicetree/bindings/regulator/regulator.yaml +++ b/Documentation/devicetree/bindings/regulator/regulator.yaml @@ -114,6 +114,11 @@ properties: description: Enable pull down resistor when the regulator is disabled. type: boolean + system-critical-regulator: + description: Set if the regulator is critical to system stability or + functionality. + type: boolean + regulator-over-current-protection: description: Enable over current protection. type: boolean @@ -181,6 +186,14 @@ properties: be enabled but limit setting can be omitted. Limit is given as microvolt offset from voltage set to regulator. + regulator-uv-less-critical-window-ms: + description: Specifies the time window (in milliseconds) following a + critical under-voltage event during which the system can continue to + operate safely while performing less critical operations. This property + provides a defined duration before a more severe reaction to the + under-voltage event is needed, allowing for certain non-urgent actions to + be carried out in preparation for potential power loss. + regulator-temp-protection-kelvin: description: Set over temperature protection limit. This is a limit where hardware performs emergency shutdown. Zero can be passed to disable diff --git a/Documentation/devicetree/bindings/rng/starfive,jh7110-trng.yaml b/Documentation/devicetree/bindings/rng/starfive,jh7110-trng.yaml index 2b76ce25acc4..4639247e9e51 100644 --- a/Documentation/devicetree/bindings/rng/starfive,jh7110-trng.yaml +++ b/Documentation/devicetree/bindings/rng/starfive,jh7110-trng.yaml @@ -11,7 +11,11 @@ maintainers: properties: compatible: - const: starfive,jh7110-trng + oneOf: + - items: + - const: starfive,jh8100-trng + - const: starfive,jh7110-trng + - const: starfive,jh7110-trng reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.txt b/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.txt deleted file mode 100644 index 8a18d71e6879..000000000000 --- a/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.txt +++ /dev/null @@ -1,31 +0,0 @@ -Analog Devices AXI SPI Engine controller Device Tree Bindings - -Required properties: -- compatible : Must be "adi,axi-spi-engine-1.00.a"" -- reg : Physical base address and size of the register map. -- interrupts : Property with a value describing the interrupt - number. -- clock-names : List of input clock names - "s_axi_aclk", "spi_clk" -- clocks : Clock phandles and specifiers (See clock bindings for - details on clock-names and clocks). -- #address-cells : Must be <1> -- #size-cells : Must be <0> - -Optional subnodes: - Subnodes are use to represent the SPI slave devices connected to the SPI - master. They follow the generic SPI bindings as outlined in spi-bus.txt. - -Example: - - spi@@44a00000 { - compatible = "adi,axi-spi-engine-1.00.a"; - reg = <0x44a00000 0x1000>; - interrupts = <0 56 4>; - clocks = <&clkc 15 &clkc 15>; - clock-names = "s_axi_aclk", "spi_clk"; - - #address-cells = <1>; - #size-cells = <0>; - - /* SPI devices */ - }; diff --git a/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.yaml b/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.yaml new file mode 100644 index 000000000000..d48faa42d025 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/adi,axi-spi-engine.yaml @@ -0,0 +1,66 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/spi/adi,axi-spi-engine.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Analog Devices AXI SPI Engine Controller + +description: | + The AXI SPI Engine controller is part of the SPI Engine framework[1] and + allows memory mapped access to the SPI Engine control bus. This allows it + to be used as a general purpose software driven SPI controller as well as + some optional advanced acceleration and offloading capabilities. + + [1] https://wiki.analog.com/resources/fpga/peripherals/spi_engine + +maintainers: + - Michael Hennerich <Michael.Hennerich@analog.com> + - Nuno Sá <nuno.sa@analog.com> + +allOf: + - $ref: /schemas/spi/spi-controller.yaml# + +properties: + compatible: + const: adi,axi-spi-engine-1.00.a + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: The AXI interconnect clock. + - description: The SPI controller clock. + + clock-names: + items: + - const: s_axi_aclk + - const: spi_clk + +required: + - compatible + - reg + - interrupts + - clocks + - clock-names + +unevaluatedProperties: false + +examples: + - | + spi@44a00000 { + compatible = "adi,axi-spi-engine-1.00.a"; + reg = <0x44a00000 0x1000>; + interrupts = <0 56 4>; + clocks = <&clkc 15>, <&clkc 15>; + clock-names = "s_axi_aclk", "spi_clk"; + + #address-cells = <1>; + #size-cells = <0>; + + /* SPI devices */ + }; diff --git a/Documentation/devicetree/bindings/spi/renesas,rspi.yaml b/Documentation/devicetree/bindings/spi/renesas,rspi.yaml index 4d8ec69214c9..0ef3f8421986 100644 --- a/Documentation/devicetree/bindings/spi/renesas,rspi.yaml +++ b/Documentation/devicetree/bindings/spi/renesas,rspi.yaml @@ -21,7 +21,7 @@ properties: - enum: - renesas,rspi-r7s72100 # RZ/A1H - renesas,rspi-r7s9210 # RZ/A2 - - renesas,r9a07g043-rspi # RZ/G2UL + - renesas,r9a07g043-rspi # RZ/G2UL and RZ/Five - renesas,r9a07g044-rspi # RZ/G2{L,LC} - renesas,r9a07g054-rspi # RZ/V2L - const: renesas,rspi-rz diff --git a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml index 6348a387a21c..fde3776a558b 100644 --- a/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml +++ b/Documentation/devicetree/bindings/spi/snps,dw-apb-ssi.yaml @@ -72,8 +72,6 @@ properties: - const: snps,dw-apb-ssi - description: Intel Keem Bay SPI Controller const: intel,keembay-ssi - - description: Intel Thunder Bay SPI Controller - const: intel,thunderbay-ssi - description: Intel Mount Evans Integrated Management Complex SPI Controller const: intel,mountevans-imc-ssi - description: AMD Pensando Elba SoC SPI Controller diff --git a/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml b/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml index ae0f082bd377..4bd9aeb81208 100644 --- a/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml +++ b/Documentation/devicetree/bindings/spi/st,stm32-spi.yaml @@ -23,7 +23,9 @@ properties: compatible: enum: - st,stm32f4-spi + - st,stm32f7-spi - st,stm32h7-spi + - st,stm32mp25-spi reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/thermal/allwinner,sun8i-a83t-ths.yaml b/Documentation/devicetree/bindings/thermal/allwinner,sun8i-a83t-ths.yaml index fbd4212285e2..9b2272a9ec15 100644 --- a/Documentation/devicetree/bindings/thermal/allwinner,sun8i-a83t-ths.yaml +++ b/Documentation/devicetree/bindings/thermal/allwinner,sun8i-a83t-ths.yaml @@ -16,6 +16,7 @@ properties: - allwinner,sun8i-a83t-ths - allwinner,sun8i-h3-ths - allwinner,sun8i-r40-ths + - allwinner,sun20i-d1-ths - allwinner,sun50i-a64-ths - allwinner,sun50i-a100-ths - allwinner,sun50i-h5-ths @@ -61,6 +62,7 @@ allOf: compatible: contains: enum: + - allwinner,sun20i-d1-ths - allwinner,sun50i-a100-ths - allwinner,sun50i-h6-ths @@ -84,7 +86,9 @@ allOf: properties: compatible: contains: - const: allwinner,sun8i-h3-ths + enum: + - allwinner,sun8i-h3-ths + - allwinner,sun20i-d1-ths then: properties: @@ -103,6 +107,7 @@ allOf: enum: - allwinner,sun8i-h3-ths - allwinner,sun8i-r40-ths + - allwinner,sun20i-d1-ths - allwinner,sun50i-a64-ths - allwinner,sun50i-a100-ths - allwinner,sun50i-h5-ths diff --git a/Documentation/devicetree/bindings/thermal/loongson,ls2k-thermal.yaml b/Documentation/devicetree/bindings/thermal/loongson,ls2k-thermal.yaml index 7538469997f9..b634f57cd011 100644 --- a/Documentation/devicetree/bindings/thermal/loongson,ls2k-thermal.yaml +++ b/Documentation/devicetree/bindings/thermal/loongson,ls2k-thermal.yaml @@ -10,6 +10,9 @@ maintainers: - zhanghongchen <zhanghongchen@loongson.cn> - Yinbo Zhu <zhuyinbo@loongson.cn> +allOf: + - $ref: /schemas/thermal/thermal-sensor.yaml# + properties: compatible: oneOf: @@ -26,12 +29,16 @@ properties: interrupts: maxItems: 1 + '#thermal-sensor-cells': + const: 1 + required: - compatible - reg - interrupts + - '#thermal-sensor-cells' -additionalProperties: false +unevaluatedProperties: false examples: - | @@ -41,4 +48,5 @@ examples: reg = <0x1fe01500 0x30>; interrupt-parent = <&liointc0>; interrupts = <7 IRQ_TYPE_LEVEL_LOW>; + #thermal-sensor-cells = <1>; }; diff --git a/Documentation/devicetree/bindings/thermal/mediatek,thermal.yaml b/Documentation/devicetree/bindings/thermal/mediatek,thermal.yaml new file mode 100644 index 000000000000..d96a2e32bd8f --- /dev/null +++ b/Documentation/devicetree/bindings/thermal/mediatek,thermal.yaml @@ -0,0 +1,99 @@ +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/thermal/mediatek,thermal.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Mediatek thermal controller for on-SoC temperatures + +maintainers: + - Sascha Hauer <s.hauer@pengutronix.de> + +description: + This device does not have its own ADC, instead it directly controls the AUXADC + via AHB bus accesses. For this reason it needs phandles to the AUXADC. Also it + controls a mux in the apmixedsys register space via AHB bus accesses, so a + phandle to the APMIXEDSYS is also needed. + +allOf: + - $ref: thermal-sensor.yaml# + +properties: + compatible: + enum: + - mediatek,mt2701-thermal + - mediatek,mt2712-thermal + - mediatek,mt7622-thermal + - mediatek,mt7981-thermal + - mediatek,mt7986-thermal + - mediatek,mt8173-thermal + - mediatek,mt8183-thermal + - mediatek,mt8365-thermal + - mediatek,mt8516-thermal + + reg: + maxItems: 1 + + interrupts: + maxItems: 1 + + clocks: + items: + - description: Main clock needed for register access + - description: The AUXADC clock + + clock-names: + items: + - const: therm + - const: auxadc + + mediatek,auxadc: + $ref: /schemas/types.yaml#/definitions/phandle + description: A phandle to the AUXADC which the thermal controller uses + + mediatek,apmixedsys: + $ref: /schemas/types.yaml#/definitions/phandle + description: A phandle to the APMIXEDSYS controller + + resets: + description: Reset controller controlling the thermal controller + + nvmem-cells: + items: + - description: + NVMEM cell with EEPROMA phandle to the calibration data provided by an + NVMEM device. If unspecified default values shall be used. + + nvmem-cell-names: + items: + - const: calibration-data + +required: + - reg + - interrupts + - clocks + - clock-names + - mediatek,auxadc + - mediatek,apmixedsys + +unevaluatedProperties: false + +examples: + - | + #include <dt-bindings/interrupt-controller/irq.h> + #include <dt-bindings/clock/mt8173-clk.h> + #include <dt-bindings/reset/mt8173-resets.h> + + thermal@1100b000 { + compatible = "mediatek,mt8173-thermal"; + reg = <0x1100b000 0x1000>; + interrupts = <0 70 IRQ_TYPE_LEVEL_LOW>; + clocks = <&pericfg CLK_PERI_THERM>, <&pericfg CLK_PERI_AUXADC>; + clock-names = "therm", "auxadc"; + resets = <&pericfg MT8173_PERI_THERM_SW_RST>; + mediatek,auxadc = <&auxadc>; + mediatek,apmixedsys = <&apmixedsys>; + nvmem-cells = <&thermal_calibration_data>; + nvmem-cell-names = "calibration-data"; + #thermal-sensor-cells = <1>; + }; diff --git a/Documentation/devicetree/bindings/thermal/mediatek-thermal.txt b/Documentation/devicetree/bindings/thermal/mediatek-thermal.txt deleted file mode 100644 index ac39c7156fde..000000000000 --- a/Documentation/devicetree/bindings/thermal/mediatek-thermal.txt +++ /dev/null @@ -1,52 +0,0 @@ -* Mediatek Thermal - -This describes the device tree binding for the Mediatek thermal controller -which measures the on-SoC temperatures. This device does not have its own ADC, -instead it directly controls the AUXADC via AHB bus accesses. For this reason -this device needs phandles to the AUXADC. Also it controls a mux in the -apmixedsys register space via AHB bus accesses, so a phandle to the APMIXEDSYS -is also needed. - -Required properties: -- compatible: - - "mediatek,mt8173-thermal" : For MT8173 family of SoCs - - "mediatek,mt2701-thermal" : For MT2701 family of SoCs - - "mediatek,mt2712-thermal" : For MT2712 family of SoCs - - "mediatek,mt7622-thermal" : For MT7622 SoC - - "mediatek,mt7981-thermal", "mediatek,mt7986-thermal" : For MT7981 SoC - - "mediatek,mt7986-thermal" : For MT7986 SoC - - "mediatek,mt8183-thermal" : For MT8183 family of SoCs - - "mediatek,mt8365-thermal" : For MT8365 family of SoCs - - "mediatek,mt8516-thermal", "mediatek,mt2701-thermal : For MT8516 family of SoCs -- reg: Address range of the thermal controller -- interrupts: IRQ for the thermal controller -- clocks, clock-names: Clocks needed for the thermal controller. required - clocks are: - "therm": Main clock needed for register access - "auxadc": The AUXADC clock -- mediatek,auxadc: A phandle to the AUXADC which the thermal controller uses -- mediatek,apmixedsys: A phandle to the APMIXEDSYS controller. -- #thermal-sensor-cells : Should be 0. See Documentation/devicetree/bindings/thermal/thermal-sensor.yaml for a description. - -Optional properties: -- resets: Reference to the reset controller controlling the thermal controller. -- nvmem-cells: A phandle to the calibration data provided by a nvmem device. If - unspecified default values shall be used. -- nvmem-cell-names: Should be "calibration-data" - -Example: - - thermal: thermal@1100b000 { - #thermal-sensor-cells = <1>; - compatible = "mediatek,mt8173-thermal"; - reg = <0 0x1100b000 0 0x1000>; - interrupts = <0 70 IRQ_TYPE_LEVEL_LOW>; - clocks = <&pericfg CLK_PERI_THERM>, <&pericfg CLK_PERI_AUXADC>; - clock-names = "therm", "auxadc"; - resets = <&pericfg MT8173_PERI_THERM_SW_RST>; - reset-names = "therm"; - mediatek,auxadc = <&auxadc>; - mediatek,apmixedsys = <&apmixedsys>; - nvmem-cells = <&thermal_calibration_data>; - nvmem-cell-names = "calibration-data"; - }; diff --git a/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm-hc.yaml b/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm-hc.yaml index 01253d58bf9f..7541e27704ca 100644 --- a/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm-hc.yaml +++ b/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm-hc.yaml @@ -114,12 +114,14 @@ examples: - | #include <dt-bindings/iio/qcom,spmi-vadc.h> #include <dt-bindings/interrupt-controller/irq.h> - spmi_bus { + + pmic { #address-cells = <1>; #size-cells = <0>; + pm8998_adc: adc@3100 { - reg = <0x3100>; compatible = "qcom,spmi-adc-rev2"; + reg = <0x3100>; #address-cells = <1>; #size-cells = <0>; #io-channel-cells = <1>; @@ -130,7 +132,7 @@ examples: }; }; - pm8998_adc_tm: adc-tm@3400 { + adc-tm@3400 { compatible = "qcom,spmi-adc-tm-hc"; reg = <0x3400>; interrupts = <0x2 0x34 0x0 IRQ_TYPE_EDGE_RISING>; diff --git a/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm5.yaml b/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm5.yaml index 3c81def03c84..d9d2657287cb 100644 --- a/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm5.yaml +++ b/Documentation/devicetree/bindings/thermal/qcom-spmi-adc-tm5.yaml @@ -167,12 +167,14 @@ examples: - | #include <dt-bindings/iio/qcom,spmi-vadc.h> #include <dt-bindings/interrupt-controller/irq.h> - spmi_bus { + + pmic { #address-cells = <1>; #size-cells = <0>; + pm8150b_adc: adc@3100 { - reg = <0x3100>; compatible = "qcom,spmi-adc5"; + reg = <0x3100>; #address-cells = <1>; #size-cells = <0>; #io-channel-cells = <1>; @@ -186,7 +188,7 @@ examples: }; }; - pm8150b_adc_tm: adc-tm@3500 { + adc-tm@3500 { compatible = "qcom,spmi-adc-tm5"; reg = <0x3500>; interrupts = <0x2 0x35 0x0 IRQ_TYPE_EDGE_RISING>; @@ -207,12 +209,14 @@ examples: #include <dt-bindings/iio/qcom,spmi-adc7-pmk8350.h> #include <dt-bindings/iio/qcom,spmi-adc7-pm8350.h> #include <dt-bindings/interrupt-controller/irq.h> - spmi_bus { + + pmic { #address-cells = <1>; #size-cells = <0>; + pmk8350_vadc: adc@3100 { - reg = <0x3100>; compatible = "qcom,spmi-adc7"; + reg = <0x3100>; #address-cells = <1>; #size-cells = <0>; #io-channel-cells = <1>; @@ -233,7 +237,7 @@ examples: }; }; - pmk8350_adc_tm: adc-tm@3400 { + adc-tm@3400 { compatible = "qcom,spmi-adc-tm5-gen2"; reg = <0x3400>; interrupts = <0x0 0x34 0x0 IRQ_TYPE_EDGE_RISING>; diff --git a/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml b/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml index 437b74732886..99d9c526c0b6 100644 --- a/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml +++ b/Documentation/devicetree/bindings/thermal/qcom-tsens.yaml @@ -66,6 +66,7 @@ properties: - qcom,sm8350-tsens - qcom,sm8450-tsens - qcom,sm8550-tsens + - qcom,sm8650-tsens - const: qcom,tsens-v2 - description: v2 of TSENS with combined interrupt diff --git a/Documentation/devicetree/bindings/thermal/thermal-zones.yaml b/Documentation/devicetree/bindings/thermal/thermal-zones.yaml index 4a8dabc48170..dbd52620d293 100644 --- a/Documentation/devicetree/bindings/thermal/thermal-zones.yaml +++ b/Documentation/devicetree/bindings/thermal/thermal-zones.yaml @@ -75,6 +75,22 @@ patternProperties: framework and assumes that the thermal sensors in this zone support interrupts. + critical-action: + $ref: /schemas/types.yaml#/definitions/string + description: | + The action the OS should perform after the critical temperature is reached. + By default the system will shutdown as a safe action to prevent damage + to the hardware, if the property is not set. + The shutdown action should be always the default and preferred one. + Choose 'reboot' with care, as the hardware may be in thermal stress, + thus leading to infinite reboots that may cause damage to the hardware. + Make sure the firmware/bootloader will act as the last resort and take + over the thermal control. + + enum: + - shutdown + - reboot + thermal-sensors: $ref: /schemas/types.yaml#/definitions/phandle-array maxItems: 1 diff --git a/Documentation/driver-api/crypto/iaa/iaa-crypto.rst b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst new file mode 100644 index 000000000000..de587cf9cbed --- /dev/null +++ b/Documentation/driver-api/crypto/iaa/iaa-crypto.rst @@ -0,0 +1,824 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +IAA Compression Accelerator Crypto Driver +========================================= + +Tom Zanussi <tom.zanussi@linux.intel.com> + +The IAA crypto driver supports compression/decompression compatible +with the DEFLATE compression standard described in RFC 1951, which is +the compression/decompression algorithm exported by this module. + +The IAA hardware spec can be found here: + + https://cdrdv2.intel.com/v1/dl/getContent/721858 + +The iaa_crypto driver is designed to work as a layer underneath +higher-level compression devices such as zswap. + +Users can select IAA compress/decompress acceleration by specifying +one of the supported IAA compression algorithms in whatever facility +allows compression algorithms to be selected. + +For example, a zswap device can select the IAA 'fixed' mode +represented by selecting the 'deflate-iaa' crypto compression +algorithm:: + + # echo deflate-iaa > /sys/module/zswap/parameters/compressor + +This will tell zswap to use the IAA 'fixed' compression mode for all +compresses and decompresses. + +Currently, there is only one compression modes available, 'fixed' +mode. + +The 'fixed' compression mode implements the compression scheme +specified by RFC 1951 and is given the crypto algorithm name +'deflate-iaa'. (Because the IAA hardware has a 4k history-window +limitation, only buffers <= 4k, or that have been compressed using a +<= 4k history window, are technically compliant with the deflate spec, +which allows for a window of up to 32k. Because of this limitation, +the IAA fixed mode deflate algorithm is given its own algorithm name +rather than simply 'deflate'). + + +Config options and other setup +============================== + +The IAA crypto driver is available via menuconfig using the following +path:: + + Cryptographic API -> Hardware crypto devices -> Support for Intel(R) IAA Compression Accelerator + +In the configuration file the option called CONFIG_CRYPTO_DEV_IAA_CRYPTO. + +The IAA crypto driver also supports statistics, which are available +via menuconfig using the following path:: + + Cryptographic API -> Hardware crypto devices -> Support for Intel(R) IAA Compression -> Enable Intel(R) IAA Compression Accelerator Statistics + +In the configuration file the option called CONFIG_CRYPTO_DEV_IAA_CRYPTO_STATS. + +The following config options should also be enabled:: + + CONFIG_IRQ_REMAP=y + CONFIG_INTEL_IOMMU=y + CONFIG_INTEL_IOMMU_SVM=y + CONFIG_PCI_ATS=y + CONFIG_PCI_PRI=y + CONFIG_PCI_PASID=y + CONFIG_INTEL_IDXD=m + CONFIG_INTEL_IDXD_SVM=y + +IAA is one of the first Intel accelerator IPs that can work in +conjunction with the Intel IOMMU. There are multiple modes that exist +for testing. Based on IOMMU configuration, there are 3 modes:: + + - Scalable + - Legacy + - No IOMMU + + +Scalable mode +------------- + +Scalable mode supports Shared Virtual Memory (SVM or SVA). It is +entered when using the kernel boot commandline:: + + intel_iommu=on,sm_on + +with VT-d turned on in BIOS. + +With scalable mode, both shared and dedicated workqueues are available +for use. + +For scalable mode, the following BIOS settings should be enabled:: + + Socket Configuration > IIO Configuration > Intel VT for Directed I/O (VT-d) > Intel VT for Directed I/O + + Socket Configuration > IIO Configuration > PCIe ENQCMD > ENQCMDS + + +Legacy mode +----------- + +Legacy mode is entered when using the kernel boot commandline:: + + intel_iommu=off + +or VT-d is not turned on in BIOS. + +If you have booted into Linux and not sure if VT-d is on, do a "dmesg +| grep -i dmar". If you don't see a number of DMAR devices enumerated, +most likely VT-d is not on. + +With legacy mode, only dedicated workqueues are available for use. + + +No IOMMU mode +------------- + +No IOMMU mode is entered when using the kernel boot commandline:: + + iommu=off. + +With no IOMMU mode, only dedicated workqueues are available for use. + + +Usage +===== + +accel-config +------------ + +When loaded, the iaa_crypto driver automatically creates a default +configuration and enables it, and assigns default driver attributes. +If a different configuration or set of driver attributes is required, +the user must first disable the IAA devices and workqueues, reset the +configuration, and then re-register the deflate-iaa algorithm with the +crypto subsystem by removing and reinserting the iaa_crypto module. + +The :ref:`iaa_disable_script` in the 'Use Cases' +section below can be used to disable the default configuration. + +See :ref:`iaa_default_config` below for details of the default +configuration. + +More likely than not, however, and because of the complexity and +configurability of the accelerator devices, the user will want to +configure the device and manually enable the desired devices and +workqueues. + +The userspace tool to help doing that is called accel-config. Using +accel-config to configure device or loading a previously saved config +is highly recommended. The device can be controlled via sysfs +directly but comes with the warning that you should do this ONLY if +you know exactly what you are doing. The following sections will not +cover the sysfs interface but assumes you will be using accel-config. + +The :ref:`iaa_sysfs_config` section in the appendix below can be +consulted for the sysfs interface details if interested. + +The accel-config tool along with instructions for building it can be +found here: + + https://github.com/intel/idxd-config/#readme + +Typical usage +------------- + +In order for the iaa_crypto module to actually do any +compression/decompression work on behalf of a facility, one or more +IAA workqueues need to be bound to the iaa_crypto driver. + +For instance, here's an example of configuring an IAA workqueue and +binding it to the iaa_crypto driver (note that device names are +specified as 'iax' rather than 'iaa' - this is because upstream still +has the old 'iax' device naming in place) :: + + # configure wq1.0 + + accel-config config-wq --group-id=0 --mode=dedicated --type=kernel --name="iaa_crypto" --device_name="crypto" iax1/wq1.0 + + # enable IAA device iax1 + + accel-config enable-device iax1 + + # enable wq1.0 on IAX device iax1 + + accel-config enable-wq iax1/wq1.0 + +Whenever a new workqueue is bound to or unbound from the iaa_crypto +driver, the available workqueues are 'rebalanced' such that work +submitted from a particular CPU is given to the most appropriate +workqueue available. Current best practice is to configure and bind +at least one workqueue for each IAA device, but as long as there is at +least one workqueue configured and bound to any IAA device in the +system, the iaa_crypto driver will work, albeit most likely not as +efficiently. + +The IAA crypto algorigthms is operational and compression and +decompression operations are fully enabled following the successful +binding of the first IAA workqueue to the iaa_crypto driver. + +Similarly, the IAA crypto algorithm is not operational and compression +and decompression operations are disabled following the unbinding of +the last IAA worqueue to the iaa_crypto driver. + +As a result, the IAA crypto algorithms and thus the IAA hardware are +only available when one or more workques are bound to the iaa_crypto +driver. + +When there are no IAA workqueues bound to the driver, the IAA crypto +algorithms can be unregistered by removing the module. + + +Driver attributes +----------------- + +There are a couple user-configurable driver attributes that can be +used to configure various modes of operation. They're listed below, +along with their default values. To set any of these attributes, echo +the appropriate values to the attribute file located under +/sys/bus/dsa/drivers/crypto/ + +The attribute settings at the time the IAA algorithms are registered +are captured in each algorithm's crypto_ctx and used for all compresses +and decompresses when using that algorithm. + +The available attributes are: + + - verify_compress + + Toggle compression verification. If set, each compress will be + internally decompressed and the contents verified, returning error + codes if unsuccessful. This can be toggled with 0/1:: + + echo 0 > /sys/bus/dsa/drivers/crypto/verify_compress + + The default setting is '1' - verify all compresses. + + - sync_mode + + Select mode to be used to wait for completion of each compresses + and decompress operation. + + The crypto async interface support implemented by iaa_crypto + provides an implementation that satisfies the interface but does + so in a synchronous manner - it fills and submits the IDXD + descriptor and then loops around waiting for it to complete before + returning. This isn't a problem at the moment, since all existing + callers (e.g. zswap) wrap any asynchronous callees in a + synchronous wrapper anyway. + + The iaa_crypto driver does however provide true asynchronous + support for callers that can make use of it. In this mode, it + fills and submits the IDXD descriptor, then returns immediately + with -EINPROGRESS. The caller can then either poll for completion + itself, which requires specific code in the caller which currently + nothing in the upstream kernel implements, or go to sleep and wait + for an interrupt signaling completion. This latter mode is + supported by current users in the kernel such as zswap via + synchronous wrappers. Although it is supported this mode is + significantly slower than the synchronous mode that does the + polling in the iaa_crypto driver previously mentioned. + + This mode can be enabled by writing 'async_irq' to the sync_mode + iaa_crypto driver attribute:: + + echo async_irq > /sys/bus/dsa/drivers/crypto/sync_mode + + Async mode without interrupts (caller must poll) can be enabled by + writing 'async' to it:: + + echo async > /sys/bus/dsa/drivers/crypto/sync_mode + + The mode that does the polling in the iaa_crypto driver can be + enabled by writing 'sync' to it:: + + echo sync > /sys/bus/dsa/drivers/crypto/sync_mode + + The default mode is 'sync'. + +.. _iaa_default_config: + +IAA Default Configuration +------------------------- + +When the iaa_crypto driver is loaded, each IAA device has a single +work queue configured for it, with the following attributes:: + + mode "dedicated" + threshold 0 + size Total WQ Size from WQCAP + priority 10 + type IDXD_WQT_KERNEL + group 0 + name "iaa_crypto" + driver_name "crypto" + +The devices and workqueues are also enabled and therefore the driver +is ready to be used without any additional configuration. + +The default driver attributes in effect when the driver is loaded are:: + + sync_mode "sync" + verify_compress 1 + +In order to change either the device/work queue or driver attributes, +the enabled devices and workqueues must first be disabled. In order +to have the new configuration applied to the deflate-iaa crypto +algorithm, it needs to be re-registered by removing and reinserting +the iaa_crypto module. The :ref:`iaa_disable_script` in the 'Use +Cases' section below can be used to disable the default configuration. + +Statistics +========== + +If the optional debugfs statistics support is enabled, the IAA crypto +driver will generate statistics which can be accessed in debugfs at:: + + # ls -al /sys/kernel/debug/iaa-crypto/ + total 0 + drwxr-xr-x 2 root root 0 Mar 3 09:35 . + drwx------ 47 root root 0 Mar 3 09:35 .. + -rw-r--r-- 1 root root 0 Mar 3 09:35 max_acomp_delay_ns + -rw-r--r-- 1 root root 0 Mar 3 09:35 max_adecomp_delay_ns + -rw-r--r-- 1 root root 0 Mar 3 09:35 max_comp_delay_ns + -rw-r--r-- 1 root root 0 Mar 3 09:35 max_decomp_delay_ns + -rw-r--r-- 1 root root 0 Mar 3 09:35 stats_reset + -rw-r--r-- 1 root root 0 Mar 3 09:35 total_comp_bytes_out + -rw-r--r-- 1 root root 0 Mar 3 09:35 total_comp_calls + -rw-r--r-- 1 root root 0 Mar 3 09:35 total_decomp_bytes_in + -rw-r--r-- 1 root root 0 Mar 3 09:35 total_decomp_calls + -rw-r--r-- 1 root root 0 Mar 3 09:35 wq_stats + +Most of the above statisticss are self-explanatory. The wq_stats file +shows per-wq stats, a set for each iaa device and wq in addition to +some global stats:: + + # cat wq_stats + global stats: + total_comp_calls: 100 + total_decomp_calls: 100 + total_comp_bytes_out: 22800 + total_decomp_bytes_in: 22800 + total_completion_einval_errors: 0 + total_completion_timeout_errors: 0 + total_completion_comp_buf_overflow_errors: 0 + + iaa device: + id: 1 + n_wqs: 1 + comp_calls: 0 + comp_bytes: 0 + decomp_calls: 0 + decomp_bytes: 0 + wqs: + name: iaa_crypto + comp_calls: 0 + comp_bytes: 0 + decomp_calls: 0 + decomp_bytes: 0 + + iaa device: + id: 3 + n_wqs: 1 + comp_calls: 0 + comp_bytes: 0 + decomp_calls: 0 + decomp_bytes: 0 + wqs: + name: iaa_crypto + comp_calls: 0 + comp_bytes: 0 + decomp_calls: 0 + decomp_bytes: 0 + + iaa device: + id: 5 + n_wqs: 1 + comp_calls: 100 + comp_bytes: 22800 + decomp_calls: 100 + decomp_bytes: 22800 + wqs: + name: iaa_crypto + comp_calls: 100 + comp_bytes: 22800 + decomp_calls: 100 + decomp_bytes: 22800 + +Writing 0 to 'stats_reset' resets all the stats, including the +per-device and per-wq stats:: + + # echo 0 > stats_reset + # cat wq_stats + global stats: + total_comp_calls: 0 + total_decomp_calls: 0 + total_comp_bytes_out: 0 + total_decomp_bytes_in: 0 + total_completion_einval_errors: 0 + total_completion_timeout_errors: 0 + total_completion_comp_buf_overflow_errors: 0 + ... + + +Use cases +========= + +Simple zswap test +----------------- + +For this example, the kernel should be configured according to the +dedicated mode options described above, and zswap should be enabled as +well:: + + CONFIG_ZSWAP=y + +This is a simple test that uses iaa_compress as the compressor for a +swap (zswap) device. It sets up the zswap device and then uses the +memory_memadvise program listed below to forcibly swap out and in a +specified number of pages, demonstrating both compress and decompress. + +The zswap test expects the work queues for each IAA device on the +system to be configured properly as a kernel workqueue with a +workqueue driver_name of "crypto". + +The first step is to make sure the iaa_crypto module is loaded:: + + modprobe iaa_crypto + +If the IAA devices and workqueues haven't previously been disabled and +reconfigured, then the default configuration should be in place and no +further IAA configuration is necessary. See :ref:`iaa_default_config` +below for details of the default configuration. + +If the default configuration is in place, you should see the iaa +devices and wq0s enabled:: + + # cat /sys/bus/dsa/devices/iax1/state + enabled + # cat /sys/bus/dsa/devices/iax1/wq1.0/state + enabled + +To demonstrate that the following steps work as expected, these +commands can be used to enable debug output:: + + # echo -n 'module iaa_crypto +p' > /sys/kernel/debug/dynamic_debug/control + # echo -n 'module idxd +p' > /sys/kernel/debug/dynamic_debug/control + +Use the following commands to enable zswap:: + + # echo 0 > /sys/module/zswap/parameters/enabled + # echo 50 > /sys/module/zswap/parameters/max_pool_percent + # echo deflate-iaa > /sys/module/zswap/parameters/compressor + # echo zsmalloc > /sys/module/zswap/parameters/zpool + # echo 1 > /sys/module/zswap/parameters/enabled + # echo 0 > /sys/module/zswap/parameters/same_filled_pages_enabled + # echo 100 > /proc/sys/vm/swappiness + # echo never > /sys/kernel/mm/transparent_hugepage/enabled + # echo 1 > /proc/sys/vm/overcommit_memory + +Now you can now run the zswap workload you want to measure. For +example, using the memory_memadvise code below, the following command +will swap in and out 100 pages:: + + ./memory_madvise 100 + + Allocating 100 pages to swap in/out + Swapping out 100 pages + Swapping in 100 pages + Swapped out and in 100 pages + +You should see something like the following in the dmesg output:: + + [ 404.202972] idxd 0000:e7:02.0: iaa_comp_acompress: dma_map_sg, src_addr 223925c000, nr_sgs 1, req->src 00000000ee7cb5e6, req->slen 4096, sg_dma_len(sg) 4096 + [ 404.202973] idxd 0000:e7:02.0: iaa_comp_acompress: dma_map_sg, dst_addr 21dadf8000, nr_sgs 1, req->dst 000000008d6acea8, req->dlen 4096, sg_dma_len(sg) 8192 + [ 404.202975] idxd 0000:e7:02.0: iaa_compress: desc->src1_addr 223925c000, desc->src1_size 4096, desc->dst_addr 21dadf8000, desc->max_dst_size 4096, desc->src2_addr 2203543000, desc->src2_size 1568 + [ 404.202981] idxd 0000:e7:02.0: iaa_compress_verify: (verify) desc->src1_addr 21dadf8000, desc->src1_size 228, desc->dst_addr 223925c000, desc->max_dst_size 4096, desc->src2_addr 0, desc->src2_size 0 + ... + +Now that basic functionality has been demonstrated, the defaults can +be erased and replaced with a different configuration. To do that, +first disable zswap:: + + # echo lzo > /sys/module/zswap/parameters/compressor + # swapoff -a + # echo 0 > /sys/module/zswap/parameters/accept_threshold_percent + # echo 0 > /sys/module/zswap/parameters/max_pool_percent + # echo 0 > /sys/module/zswap/parameters/enabled + # echo 0 > /sys/module/zswap/parameters/enabled + +Then run the :ref:`iaa_disable_script` in the 'Use Cases' section +below to disable the default configuration. + +Finally turn swap back on:: + + # swapon -a + +Following all that the IAA device(s) can now be re-configured and +enabled as desired for further testing. Below is one example. + +The zswap test expects the work queues for each IAA device on the +system to be configured properly as a kernel workqueue with a +workqueue driver_name of "crypto". + +The below script automatically does that:: + + #!/bin/bash + + echo "IAA devices:" + lspci -d:0cfe + echo "# IAA devices:" + lspci -d:0cfe | wc -l + + # + # count iaa instances + # + iaa_dev_id="0cfe" + num_iaa=$(lspci -d:${iaa_dev_id} | wc -l) + echo "Found ${num_iaa} IAA instances" + + # + # disable iaa wqs and devices + # + echo "Disable IAA" + + for ((i = 1; i < ${num_iaa} * 2; i += 2)); do + echo disable wq iax${i}/wq${i}.0 + accel-config disable-wq iax${i}/wq${i}.0 + echo disable iaa iax${i} + accel-config disable-device iax${i} + done + + echo "End Disable IAA" + + # + # configure iaa wqs and devices + # + echo "Configure IAA" + for ((i = 1; i < ${num_iaa} * 2; i += 2)); do + accel-config config-wq --group-id=0 --mode=dedicated --size=128 --priority=10 --type=kernel --name="iaa_crypto" --driver_name="crypto" iax${i}/wq${i} + done + + echo "End Configure IAA" + + # + # enable iaa wqs and devices + # + echo "Enable IAA" + + for ((i = 1; i < ${num_iaa} * 2; i += 2)); do + echo enable iaa iaa${i} + accel-config enable-device iaa${i} + echo enable wq iaa${i}/wq${i}.0 + accel-config enable-wq iaa${i}/wq${i}.0 + done + + echo "End Enable IAA" + +When the workqueues are bound to the iaa_crypto driver, you should +see something similar to the following in dmesg output if you've +enabled debug output (echo -n 'module iaa_crypto +p' > +/sys/kernel/debug/dynamic_debug/control):: + + [ 60.752344] idxd 0000:f6:02.0: add_iaa_wq: added wq 000000004068d14d to iaa 00000000c9585ba2, n_wq 1 + [ 60.752346] iaa_crypto: rebalance_wq_table: nr_nodes=2, nr_cpus 160, nr_iaa 8, cpus_per_iaa 20 + [ 60.752347] iaa_crypto: rebalance_wq_table: iaa=0 + [ 60.752349] idxd 0000:6a:02.0: request_iaa_wq: getting wq from iaa_device 0000000042d7bc52 (0) + [ 60.752350] idxd 0000:6a:02.0: request_iaa_wq: returning unused wq 00000000c8bb4452 (0) from iaa device 0000000042d7bc52 (0) + [ 60.752352] iaa_crypto: rebalance_wq_table: assigned wq for cpu=0, node=0 = wq 00000000c8bb4452 + [ 60.752354] iaa_crypto: rebalance_wq_table: iaa=0 + [ 60.752355] idxd 0000:6a:02.0: request_iaa_wq: getting wq from iaa_device 0000000042d7bc52 (0) + [ 60.752356] idxd 0000:6a:02.0: request_iaa_wq: returning unused wq 00000000c8bb4452 (0) from iaa device 0000000042d7bc52 (0) + [ 60.752358] iaa_crypto: rebalance_wq_table: assigned wq for cpu=1, node=0 = wq 00000000c8bb4452 + [ 60.752359] iaa_crypto: rebalance_wq_table: iaa=0 + [ 60.752360] idxd 0000:6a:02.0: request_iaa_wq: getting wq from iaa_device 0000000042d7bc52 (0) + [ 60.752361] idxd 0000:6a:02.0: request_iaa_wq: returning unused wq 00000000c8bb4452 (0) from iaa device 0000000042d7bc52 (0) + [ 60.752362] iaa_crypto: rebalance_wq_table: assigned wq for cpu=2, node=0 = wq 00000000c8bb4452 + [ 60.752364] iaa_crypto: rebalance_wq_table: iaa=0 + . + . + . + +Once the workqueues and devices have been enabled, the IAA crypto +algorithms are enabled and available. When the IAA crypto algorithms +have been successfully enabled, you should see the following dmesg +output:: + + [ 64.893759] iaa_crypto: iaa_crypto_enable: iaa_crypto now ENABLED + +Now run the following zswap-specific setup commands to have zswap use +the 'fixed' compression mode:: + + echo 0 > /sys/module/zswap/parameters/enabled + echo 50 > /sys/module/zswap/parameters/max_pool_percent + echo deflate-iaa > /sys/module/zswap/parameters/compressor + echo zsmalloc > /sys/module/zswap/parameters/zpool + echo 1 > /sys/module/zswap/parameters/enabled + echo 0 > /sys/module/zswap/parameters/same_filled_pages_enabled + + echo 100 > /proc/sys/vm/swappiness + echo never > /sys/kernel/mm/transparent_hugepage/enabled + echo 1 > /proc/sys/vm/overcommit_memory + +Finally, you can now run the zswap workload you want to measure. For +example, using the code below, the following command will swap in and +out 100 pages:: + + ./memory_madvise 100 + + Allocating 100 pages to swap in/out + Swapping out 100 pages + Swapping in 100 pages + Swapped out and in 100 pages + +You should see something like the following in the dmesg output if +you've enabled debug output (echo -n 'module iaa_crypto +p' > +/sys/kernel/debug/dynamic_debug/control):: + + [ 404.202972] idxd 0000:e7:02.0: iaa_comp_acompress: dma_map_sg, src_addr 223925c000, nr_sgs 1, req->src 00000000ee7cb5e6, req->slen 4096, sg_dma_len(sg) 4096 + [ 404.202973] idxd 0000:e7:02.0: iaa_comp_acompress: dma_map_sg, dst_addr 21dadf8000, nr_sgs 1, req->dst 000000008d6acea8, req->dlen 4096, sg_dma_len(sg) 8192 + [ 404.202975] idxd 0000:e7:02.0: iaa_compress: desc->src1_addr 223925c000, desc->src1_size 4096, desc->dst_addr 21dadf8000, desc->max_dst_size 4096, desc->src2_addr 2203543000, desc->src2_size 1568 + [ 404.202981] idxd 0000:e7:02.0: iaa_compress_verify: (verify) desc->src1_addr 21dadf8000, desc->src1_size 228, desc->dst_addr 223925c000, desc->max_dst_size 4096, desc->src2_addr 0, desc->src2_size 0 + [ 409.203227] idxd 0000:e7:02.0: iaa_comp_adecompress: dma_map_sg, src_addr 21ddd8b100, nr_sgs 1, req->src 0000000084adab64, req->slen 228, sg_dma_len(sg) 228 + [ 409.203235] idxd 0000:e7:02.0: iaa_comp_adecompress: dma_map_sg, dst_addr 21ee3dc000, nr_sgs 1, req->dst 000000004e2990d0, req->dlen 4096, sg_dma_len(sg) 4096 + [ 409.203239] idxd 0000:e7:02.0: iaa_decompress: desc->src1_addr 21ddd8b100, desc->src1_size 228, desc->dst_addr 21ee3dc000, desc->max_dst_size 4096, desc->src2_addr 0, desc->src2_size 0 + [ 409.203254] idxd 0000:e7:02.0: iaa_comp_adecompress: dma_map_sg, src_addr 21ddd8b100, nr_sgs 1, req->src 0000000084adab64, req->slen 228, sg_dma_len(sg) 228 + [ 409.203256] idxd 0000:e7:02.0: iaa_comp_adecompress: dma_map_sg, dst_addr 21f1551000, nr_sgs 1, req->dst 000000004e2990d0, req->dlen 4096, sg_dma_len(sg) 4096 + [ 409.203257] idxd 0000:e7:02.0: iaa_decompress: desc->src1_addr 21ddd8b100, desc->src1_size 228, desc->dst_addr 21f1551000, desc->max_dst_size 4096, desc->src2_addr 0, desc->src2_size 0 + +In order to unregister the IAA crypto algorithms, and register new +ones using different parameters, any users of the current algorithm +should be stopped and the IAA workqueues and devices disabled. + +In the case of zswap, remove the IAA crypto algorithm as the +compressor and turn off swap (to remove all references to +iaa_crypto):: + + echo lzo > /sys/module/zswap/parameters/compressor + swapoff -a + + echo 0 > /sys/module/zswap/parameters/accept_threshold_percent + echo 0 > /sys/module/zswap/parameters/max_pool_percent + echo 0 > /sys/module/zswap/parameters/enabled + +Once zswap is disabled and no longer using iaa_crypto, the IAA wqs and +devices can be disabled. + +.. _iaa_disable_script: + +IAA disable script +------------------ + +The below script automatically does that:: + + #!/bin/bash + + echo "IAA devices:" + lspci -d:0cfe + echo "# IAA devices:" + lspci -d:0cfe | wc -l + + # + # count iaa instances + # + iaa_dev_id="0cfe" + num_iaa=$(lspci -d:${iaa_dev_id} | wc -l) + echo "Found ${num_iaa} IAA instances" + + # + # disable iaa wqs and devices + # + echo "Disable IAA" + + for ((i = 1; i < ${num_iaa} * 2; i += 2)); do + echo disable wq iax${i}/wq${i}.0 + accel-config disable-wq iax${i}/wq${i}.0 + echo disable iaa iax${i} + accel-config disable-device iax${i} + done + + echo "End Disable IAA" + +Finally, at this point the iaa_crypto module can be removed, which +will unregister the current IAA crypto algorithms:: + + rmmod iaa_crypto + + +memory_madvise.c (gcc -o memory_memadvise memory_madvise.c):: + + #include <stdio.h> + #include <stdlib.h> + #include <string.h> + #include <unistd.h> + #include <sys/mman.h> + #include <linux/mman.h> + + #ifndef MADV_PAGEOUT + #define MADV_PAGEOUT 21 /* force pages out immediately */ + #endif + + #define PG_SZ 4096 + + int main(int argc, char **argv) + { + int i, nr_pages = 1; + int64_t *dump_ptr; + char *addr, *a; + int loop = 1; + + if (argc > 1) + nr_pages = atoi(argv[1]); + + printf("Allocating %d pages to swap in/out\n", nr_pages); + + /* allocate pages */ + addr = mmap(NULL, nr_pages * PG_SZ, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); + *addr = 1; + + /* initialize data in page to all '*' chars */ + memset(addr, '*', nr_pages * PG_SZ); + + printf("Swapping out %d pages\n", nr_pages); + + /* Tell kernel to swap it out */ + madvise(addr, nr_pages * PG_SZ, MADV_PAGEOUT); + + while (loop > 0) { + /* Wait for swap out to finish */ + sleep(5); + + a = addr; + + printf("Swapping in %d pages\n", nr_pages); + + /* Access the page ... this will swap it back in again */ + for (i = 0; i < nr_pages; i++) { + if (a[0] != '*') { + printf("Bad data from decompress!!!!!\n"); + + dump_ptr = (int64_t *)a; + for (int j = 0; j < 100; j++) { + printf(" page %d data: %#llx\n", i, *dump_ptr); + dump_ptr++; + } + } + + a += PG_SZ; + } + + loop --; + } + + printf("Swapped out and in %d pages\n", nr_pages); + +Appendix +======== + +.. _iaa_sysfs_config: + +IAA sysfs config interface +-------------------------- + +Below is a description of the IAA sysfs interface, which as mentioned +in the main document, should only be used if you know exactly what you +are doing. Even then, there's no compelling reason to use it directly +since accel-config can do everything the sysfs interface can and in +fact accel-config is based on it under the covers. + +The 'IAA config path' is /sys/bus/dsa/devices and contains +subdirectories representing each IAA device, workqueue, engine, and +group. Note that in the sysfs interface, the IAA devices are actually +named using iax e.g. iax1, iax3, etc. (Note that IAA devices are the +odd-numbered devices; the even-numbered devices are DSA devices and +can be ignored for IAA). + +The 'IAA device bind path' is /sys/bus/dsa/drivers/idxd/bind and is +the file that is written to enable an IAA device. + +The 'IAA workqueue bind path' is /sys/bus/dsa/drivers/crypto/bind and +is the file that is written to enable an IAA workqueue. + +Similarly /sys/bus/dsa/drivers/idxd/unbind and +/sys/bus/dsa/drivers/crypto/unbind are used to disable IAA devices and +workqueues. + +The basic sequence of commands needed to set up the IAA devices and +workqueues is: + +For each device:: + 1) Disable any workqueues enabled on the device. For example to + disable workques 0 and 1 on IAA device 3:: + + # echo wq3.0 > /sys/bus/dsa/drivers/crypto/unbind + # echo wq3.1 > /sys/bus/dsa/drivers/crypto/unbind + + 2) Disable the device. For example to disable IAA device 3:: + + # echo iax3 > /sys/bus/dsa/drivers/idxd/unbind + + 3) configure the desired workqueues. For example, to configure + workqueue 3 on IAA device 3:: + + # echo dedicated > /sys/bus/dsa/devices/iax3/wq3.3/mode + # echo 128 > /sys/bus/dsa/devices/iax3/wq3.3/size + # echo 0 > /sys/bus/dsa/devices/iax3/wq3.3/group_id + # echo 10 > /sys/bus/dsa/devices/iax3/wq3.3/priority + # echo "kernel" > /sys/bus/dsa/devices/iax3/wq3.3/type + # echo "iaa_crypto" > /sys/bus/dsa/devices/iax3/wq3.3/name + # echo "crypto" > /sys/bus/dsa/devices/iax3/wq3.3/driver_name + + 4) Enable the device. For example to enable IAA device 3:: + + # echo iax3 > /sys/bus/dsa/drivers/idxd/bind + + 5) Enable the desired workqueues on the device. For example to + enable workques 0 and 1 on IAA device 3:: + + # echo wq3.0 > /sys/bus/dsa/drivers/crypto/bind + # echo wq3.1 > /sys/bus/dsa/drivers/crypto/bind diff --git a/Documentation/driver-api/crypto/iaa/index.rst b/Documentation/driver-api/crypto/iaa/index.rst new file mode 100644 index 000000000000..aa6837e27264 --- /dev/null +++ b/Documentation/driver-api/crypto/iaa/index.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================= +IAA (Intel Analytics Accelerator) +================================= + +IAA provides hardware compression and decompression via the crypto +API. + +.. toctree:: + :maxdepth: 1 + + iaa-crypto + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/crypto/index.rst b/Documentation/driver-api/crypto/index.rst new file mode 100644 index 000000000000..fb9709b98bea --- /dev/null +++ b/Documentation/driver-api/crypto/index.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Crypto Drivers +============== + +Documentation for crypto drivers that may need more involved setup and +configuration. + +.. toctree:: + :maxdepth: 1 + + iaa/index + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index f549a68951d7..3cc0e1a517c0 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -115,6 +115,8 @@ available subsections can be seen below. hte/index wmi dpll + wbrf + crypto/index .. only:: subproject and html diff --git a/Documentation/driver-api/mtd/spi-nor.rst b/Documentation/driver-api/mtd/spi-nor.rst index c22f8c0f7950..148fa4288760 100644 --- a/Documentation/driver-api/mtd/spi-nor.rst +++ b/Documentation/driver-api/mtd/spi-nor.rst @@ -2,64 +2,204 @@ SPI NOR framework ================= -Part I - Why do we need this framework? ---------------------------------------- - -SPI bus controllers (drivers/spi/) only deal with streams of bytes; the bus -controller operates agnostic of the specific device attached. However, some -controllers (such as Freescale's QuadSPI controller) cannot easily handle -arbitrary streams of bytes, but rather are designed specifically for SPI NOR. - -In particular, Freescale's QuadSPI controller must know the NOR commands to -find the right LUT sequence. Unfortunately, the SPI subsystem has no notion of -opcodes, addresses, or data payloads; a SPI controller simply knows to send or -receive bytes (Tx and Rx). Therefore, we must define a new layering scheme under -which the controller driver is aware of the opcodes, addressing, and other -details of the SPI NOR protocol. - -Part II - How does the framework work? --------------------------------------- - -This framework just adds a new layer between the MTD and the SPI bus driver. -With this new layer, the SPI NOR controller driver does not depend on the -m25p80 code anymore. - -Before this framework, the layer is like:: - - MTD - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - -After this framework, the layer is like:: - - MTD - ------------------------ - SPI NOR framework - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - -With the SPI NOR controller driver (Freescale QuadSPI), it looks like:: - - MTD - ------------------------ - SPI NOR framework - ------------------------ - fsl-quadSPI - ------------------------ - SPI NOR chip - -Part III - How can drivers use the framework? ---------------------------------------------- - -The main API is spi_nor_scan(). Before you call the hook, a driver should -initialize the necessary fields for spi_nor{}. Please see -drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to spi-fsl-qspi.c -when you want to write a new driver for a SPI NOR controller. +How to propose a new flash addition +----------------------------------- + +Most SPI NOR flashes comply with the JEDEC JESD216 +Serial Flash Discoverable Parameter (SFDP) standard. SFDP describes +the functional and feature capabilities of serial flash devices in a +standard set of internal read-only parameter tables. + +The SPI NOR driver queries the SFDP tables in order to determine the +flash's parameters and settings. If the flash defines the SFDP tables +it's likely that you won't need a flash entry at all, and instead +rely on the generic flash driver which probes the flash solely based +on its SFDP data. All one has to do is to specify the "jedec,spi-nor" +compatible in the device tree. + +There are cases however where you need to define an explicit flash +entry. This typically happens when the flash has settings or support +that is not covered by the SFDP tables (e.g. Block Protection), or +when the flash contains mangled SFDP data. If the later, one needs +to implement the ``spi_nor_fixups`` hooks in order to amend the SFDP +parameters with the correct values. + +Minimum testing requirements +----------------------------- + +Do all the tests from below and paste them in the commit's comments +section, after the ``---`` marker. + +1) Specify the controller that you used to test the flash and specify + the frequency at which the flash was operated, e.g.:: + + This flash is populated on the X board and was tested at Y + frequency using the Z (put compatible) SPI controller. + +2) Dump the sysfs entries and print the md5/sha1/sha256 SFDP checksum:: + + root@1:~# cat /sys/bus/spi/devices/spi0.0/spi-nor/partname + sst26vf064b + root@1:~# cat /sys/bus/spi/devices/spi0.0/spi-nor/jedec_id + bf2643 + root@1:~# cat /sys/bus/spi/devices/spi0.0/spi-nor/manufacturer + sst + root@1:~# xxd -p /sys/bus/spi/devices/spi0.0/spi-nor/sfdp + 53464450060102ff00060110300000ff81000106000100ffbf0001180002 + 0001fffffffffffffffffffffffffffffffffd20f1ffffffff0344eb086b + 083b80bbfeffffffffff00ffffff440b0c200dd80fd810d820914824806f + 1d81ed0f773830b030b0f7ffffff29c25cfff030c080ffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffff0004fff37f0000f57f0000f9ff + 7d00f57f0000f37f0000ffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + ffffbf2643ffb95ffdff30f260f332ff0a122346ff0f19320f1919ffffff + ffffffff00669938ff05013506040232b03072428de89888a585c09faf5a + ffff06ec060c0003080bffffffffff07ffff0202ff060300fdfd040700fc + 0300fefe0202070e + root@1:~# sha256sum /sys/bus/spi/devices/spi0.0/spi-nor/sfdp + 428f34d0461876f189ac97f93e68a05fa6428c6650b3b7baf736a921e5898ed1 /sys/bus/spi/devices/spi0.0/spi-nor/sfdp + + Please dump the SFDP tables using ``xxd -p``. It enables us to do + the reverse operation and convert the hexdump to binary with + ``xxd -rp``. Dumping the SFDP data with ``hexdump -Cv`` is accepted, + but less desirable. + +3) Dump debugfs data:: + + root@1:~# cat /sys/kernel/debug/spi-nor/spi0.0/capabilities + Supported read modes by the flash + 1S-1S-1S + opcode 0x03 + mode cycles 0 + dummy cycles 0 + 1S-1S-1S (fast read) + opcode 0x0b + mode cycles 0 + dummy cycles 8 + 1S-1S-2S + opcode 0x3b + mode cycles 0 + dummy cycles 8 + 1S-2S-2S + opcode 0xbb + mode cycles 4 + dummy cycles 0 + 1S-1S-4S + opcode 0x6b + mode cycles 0 + dummy cycles 8 + 1S-4S-4S + opcode 0xeb + mode cycles 2 + dummy cycles 4 + 4S-4S-4S + opcode 0x0b + mode cycles 2 + dummy cycles 4 + + Supported page program modes by the flash + 1S-1S-1S + opcode 0x02 + + root@1:~# cat /sys/kernel/debug/spi-nor/spi0.0/params + name sst26vf064b + id bf 26 43 bf 26 43 + size 8.00 MiB + write size 1 + page size 256 + address nbytes 3 + flags HAS_LOCK | HAS_16BIT_SR | SOFT_RESET | SWP_IS_VOLATILE + + opcodes + read 0xeb + dummy cycles 6 + erase 0x20 + program 0x02 + 8D extension none + + protocols + read 1S-4S-4S + write 1S-1S-1S + register 1S-1S-1S + + erase commands + 20 (4.00 KiB) [0] + d8 (8.00 KiB) [1] + d8 (32.0 KiB) [2] + d8 (64.0 KiB) [3] + c7 (8.00 MiB) + + sector map + region (in hex) | erase mask | flags + ------------------+------------+---------- + 00000000-00007fff | [01 ] | + 00008000-0000ffff | [0 2 ] | + 00010000-007effff | [0 3] | + 007f0000-007f7fff | [0 2 ] | + 007f8000-007fffff | [01 ] | + +4) Use `mtd-utils <https://git.infradead.org/mtd-utils.git>`__ + and verify that erase, read and page program operations work fine:: + + root@1:~# dd if=/dev/urandom of=./spi_test bs=1M count=2 + 2+0 records in + 2+0 records out + 2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.848566 s, 2.5 MB/s + + root@1:~# mtd_debug erase /dev/mtd0 0 2097152 + Erased 2097152 bytes from address 0x00000000 in flash + + root@1:~# mtd_debug read /dev/mtd0 0 2097152 spi_read + Copied 2097152 bytes from address 0x00000000 in flash to spi_read + + root@1:~# hexdump spi_read + 0000000 ffff ffff ffff ffff ffff ffff ffff ffff + * + 0200000 + + root@1:~# sha256sum spi_read + 4bda3a28f4ffe603c0ec1258c0034d65a1a0d35ab7bd523a834608adabf03cc5 spi_read + + root@1:~# mtd_debug write /dev/mtd0 0 2097152 spi_test + Copied 2097152 bytes from spi_test to address 0x00000000 in flash + + root@1:~# mtd_debug read /dev/mtd0 0 2097152 spi_read + Copied 2097152 bytes from address 0x00000000 in flash to spi_read + + root@1:~# sha256sum spi* + c444216a6ba2a4a66cccd60a0dd062bce4b865dd52b200ef5e21838c4b899ac8 spi_read + c444216a6ba2a4a66cccd60a0dd062bce4b865dd52b200ef5e21838c4b899ac8 spi_test + + If the flash comes erased by default and the previous erase was ignored, + we won't catch it, thus test the erase again:: + + root@1:~# mtd_debug erase /dev/mtd0 0 2097152 + Erased 2097152 bytes from address 0x00000000 in flash + + root@1:~# mtd_debug read /dev/mtd0 0 2097152 spi_read + Copied 2097152 bytes from address 0x00000000 in flash to spi_read + + root@1:~# sha256sum spi* + 4bda3a28f4ffe603c0ec1258c0034d65a1a0d35ab7bd523a834608adabf03cc5 spi_read + c444216a6ba2a4a66cccd60a0dd062bce4b865dd52b200ef5e21838c4b899ac8 spi_test + + Dump some other relevant data:: + + root@1:~# mtd_debug info /dev/mtd0 + mtd.type = MTD_NORFLASH + mtd.flags = MTD_CAP_NORFLASH + mtd.size = 8388608 (8M) + mtd.erasesize = 4096 (4K) + mtd.writesize = 1 + mtd.oobsize = 0 + regions = 0 diff --git a/Documentation/driver-api/surface_aggregator/ssh.rst b/Documentation/driver-api/surface_aggregator/ssh.rst index b955b673838b..58a757319931 100644 --- a/Documentation/driver-api/surface_aggregator/ssh.rst +++ b/Documentation/driver-api/surface_aggregator/ssh.rst @@ -39,7 +39,7 @@ Note that the standard disclaimer for this subsystem also applies to this document: All of this has been reverse-engineered and may thus be erroneous and/or incomplete. -All CRCs used in the following are two-byte ``crc_ccitt_false(0xffff, ...)``. +All CRCs used in the following are two-byte ``crc_itu_t(0xffff, ...)``. All multi-byte values are little-endian, there is no implicit padding between values. diff --git a/Documentation/driver-api/wbrf.rst b/Documentation/driver-api/wbrf.rst new file mode 100644 index 000000000000..f48bfa029813 --- /dev/null +++ b/Documentation/driver-api/wbrf.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +================================= +WBRF - Wifi Band RFI Mitigations +================================= + +Due to electrical and mechanical constraints in certain platform designs +there may be likely interference of relatively high-powered harmonics of +the GPU memory clocks with local radio module frequency bands used by +certain Wifi bands. + +To mitigate possible RFI interference producers can advertise the +frequencies in use and consumers can use this information to avoid using +these frequencies for sensitive features. + +When a platform is known to have this issue with any contained devices, +the platform designer will advertise the availability of this feature via +ACPI devices with a device specific method (_DSM). +* Producers with this _DSM will be able to advertise the frequencies in use. +* Consumers with this _DSM will be able to register for notifications of +frequencies in use. + +Some general terms +================== + +Producer: such component who can produce high-powered radio frequency +Consumer: such component who can adjust its in-use frequency in +response to the radio frequencies of other components to mitigate the +possible RFI. + +To make the mechanism function, those producers should notify active use +of their particular frequencies so that other consumers can make relative +internal adjustments as necessary to avoid this resonance. + +ACPI interface +============== + +Although initially used by for wifi + dGPU use cases, the ACPI interface +can be scaled to any type of device that a platform designer discovers +can cause interference. + +The GUID used for the _DSM is 7B7656CF-DC3D-4C1C-83E9-66E721DE3070. + +3 functions are available in this _DSM: + +* 0: discover # of functions available +* 1: record RF bands in use +* 2: retrieve RF bands in use + +Driver programming interface +============================ + +.. kernel-doc:: drivers/platform/x86/amd/wbrf.c + +Sample Usage +============= + +The expected flow for the producers: +1. During probe, call `acpi_amd_wbrf_supported_producer` to check if WBRF +can be enabled for the device. +2. On using some frequency band, call `acpi_amd_wbrf_add_remove` with 'add' +param to get other consumers properly notified. +3. Or on stopping using some frequency band, call +`acpi_amd_wbrf_add_remove` with 'remove' param to get other consumers notified. + +The expected flow for the consumers: +1. During probe, call `acpi_amd_wbrf_supported_consumer` to check if WBRF +can be enabled for the device. +2. Call `amd_wbrf_register_notifier` to register for notification +of frequency band change(add or remove) from other producers. +3. Call the `amd_wbrf_retrieve_freq_band` initally to retrieve +current active frequency bands considering some producers may broadcast +such information before the consumer is up. +4. On receiving a notification for frequency band change, run +`amd_wbrf_retrieve_freq_band` again to retrieve the latest +active frequency bands. +5. During driver cleanup, call `amd_wbrf_unregister_notifier` to +unregister the notifier. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index 1b84f818e574..e86b886b64d0 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -31,15 +31,15 @@ However, except for filenames, fscrypt does not encrypt filesystem metadata. Unlike eCryptfs, which is a stacked filesystem, fscrypt is integrated -directly into supported filesystems --- currently ext4, F2FS, and -UBIFS. This allows encrypted files to be read and written without -caching both the decrypted and encrypted pages in the pagecache, -thereby nearly halving the memory used and bringing it in line with -unencrypted files. Similarly, half as many dentries and inodes are -needed. eCryptfs also limits encrypted filenames to 143 bytes, -causing application compatibility issues; fscrypt allows the full 255 -bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API can be -used by unprivileged users, with no need to mount anything. +directly into supported filesystems --- currently ext4, F2FS, UBIFS, +and CephFS. This allows encrypted files to be read and written +without caching both the decrypted and encrypted pages in the +pagecache, thereby nearly halving the memory used and bringing it in +line with unencrypted files. Similarly, half as many dentries and +inodes are needed. eCryptfs also limits encrypted filenames to 143 +bytes, causing application compatibility issues; fscrypt allows the +full 255 bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API +can be used by unprivileged users, with no need to mount anything. fscrypt does not support encrypting files in-place. Instead, it supports marking an empty directory as encrypted. Then, after @@ -1382,7 +1382,8 @@ directory.) These structs are defined as follows:: u8 contents_encryption_mode; u8 filenames_encryption_mode; u8 flags; - u8 __reserved[4]; + u8 log2_data_unit_size; + u8 __reserved[3]; u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE]; u8 nonce[FSCRYPT_FILE_NONCE_SIZE]; }; diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse-io.rst index 255a368fe534..6464de4266ad 100644 --- a/Documentation/filesystems/fuse-io.rst +++ b/Documentation/filesystems/fuse-io.rst @@ -15,7 +15,8 @@ The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the FUSE_OPEN reply. In direct-io mode the page cache is completely bypassed for reads and writes. -No read-ahead takes place. Shared mmap is disabled. +No read-ahead takes place. Shared mmap is disabled by default. To allow shared +mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply. In cached mode reads may be satisfied from the page cache, and data may be read-ahead by the kernel to fill the cache. The cache is always kept consistent diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 09cade7eaefc..e18bc5ae3b35 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -121,8 +121,5 @@ Documentation for filesystem implementations. udf virtiofs vfat - xfs-delayed-logging-design - xfs-maintainer-entry-profile - xfs-self-describing-metadata - xfs-online-fsck-design + xfs/index zonefs diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 7be2900806c8..421daf837940 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -261,7 +261,7 @@ prototypes:: struct folio *src, enum migrate_mode); int (*launder_folio)(struct folio *); bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count); - int (*error_remove_page)(struct address_space *, struct page *); + int (*error_remove_folio)(struct address_space *, struct folio *); int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_deactivate)(struct file *); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); @@ -287,7 +287,7 @@ direct_IO: migrate_folio: yes (both) launder_folio: yes is_partially_uptodate: yes -error_remove_page: yes +error_remove_folio: yes swap_activate: no swap_deactivate: no swap_rw: yes, unlocks diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 0407f361f32a..1c244866041a 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -39,7 +39,7 @@ objects in the original filesystem. On 64bit systems, even if all overlay layers are not on the same underlying filesystem, the same compliant behavior could be achieved with the "xino" feature. The "xino" feature composes a unique object -identifier from the real object st_ino and an underlying fsid index. +identifier from the real object st_ino and an underlying fsid number. The "xino" feature uses the high inode number bits for fsid, because the underlying filesystems rarely use the high inode number bits. In case the underlying inode number does overflow into the high xino bits, overlay @@ -118,7 +118,7 @@ Where both upper and lower objects are directories, a merged directory is formed. At mount time, the two directories given as mount options "lowerdir" and -"upperdir" are combined into a merged directory: +"upperdir" are combined into a merged directory:: mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ workdir=/work /merged @@ -172,12 +172,12 @@ directory is being read. This is unlikely to be noticed by many programs. seek offsets are assigned sequentially when the directories are read. -Thus if +Thus if: - - read part of a directory - - remember an offset, and close the directory - - re-open the directory some time later - - seek to the remembered offset + - read part of a directory + - remember an offset, and close the directory + - re-open the directory some time later + - seek to the remembered offset there may be little correlation between the old and new locations in the list of filenames, particularly if anything has changed in the @@ -290,9 +290,9 @@ Permission checking in the overlay filesystem follows these principles: 2) task creating the overlay mount MUST NOT gain additional privileges 3) non-mounting task MAY gain additional privileges through the overlay, - compared to direct access on underlying lower or upper filesystems + compared to direct access on underlying lower or upper filesystems -This is achieved by performing two permission checks on each access +This is achieved by performing two permission checks on each access: a) check if current task is allowed access based on local DAC (owner, group, mode and posix acl), as well as MAC checks @@ -311,11 +311,11 @@ to create setups where the consistency rule (1) does not hold; normally, however, the mounting task will have sufficient privileges to perform all operations. -Another way to demonstrate this model is drawing parallels between +Another way to demonstrate this model is drawing parallels between:: mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged -and +and:: cp -a /lower /upper mount --bind /upper /merged @@ -328,7 +328,7 @@ Multiple lower layers --------------------- Multiple lower layers can now be given using the colon (":") as a -separator character between the directory names. For example: +separator character between the directory names. For example:: mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged @@ -340,13 +340,13 @@ rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. Note: directory names containing colons can be provided as lower layer by -escaping the colons with a single backslash. For example: +escaping the colons with a single backslash. For example:: mount -t overlay overlay -olowerdir=/a\:lower\:\:dir /merged Since kernel version v6.8, directory names containing colons can also be configured as lower layer using the "lowerdir+" mount options and the -fsconfig syscall from new mount api. For example: +fsconfig syscall from new mount api. For example:: fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/a:lower::dir", 0); @@ -356,7 +356,7 @@ as an octal characters (\072) when displayed in /proc/self/mountinfo. Metadata only copy up --------------------- -When metadata only copy up feature is enabled, overlayfs will only copy +When the "metacopy" feature is enabled, overlayfs will only copy up metadata (as opposed to whole file), when a metadata specific operation like chown/chmod is performed. Full file will be copied up later when file is opened for WRITE operation. @@ -405,7 +405,7 @@ A normal lower layer is not allowed to be below a data-only layer, so single colon separators are not allowed to the right of double colon ("::") separators. -For example: +For example:: mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged @@ -419,7 +419,7 @@ to the absolute path of the "lower data" file in the "data-only" lower layer. Since kernel version v6.8, "data-only" lower layers can also be added using the "datadir+" mount options and the fsconfig syscall from new mount api. -For example: +For example:: fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l1", 0); fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l2", 0); @@ -429,7 +429,7 @@ For example: fs-verity support ----------------------- +----------------- During metadata copy up of a lower file, if the source file has fs-verity enabled and overlay verity support is enabled, then the @@ -492,27 +492,27 @@ though it will not result in a crash or deadlock. Mounting an overlay using an upper layer path, where the upper layer path was previously used by another mounted overlay in combination with a -different lower layer path, is allowed, unless the "inodes index" feature -or "metadata only copy up" feature is enabled. +different lower layer path, is allowed, unless the "index" or "metacopy" +features are enabled. -With the "inodes index" feature, on the first time mount, an NFS file +With the "index" feature, on the first time mount, an NFS file handle of the lower layer root directory, along with the UUID of the lower filesystem, are encoded and stored in the "trusted.overlay.origin" extended attribute on the upper layer root directory. On subsequent mount attempts, the lower root directory file handle and lower filesystem UUID are compared to the stored origin in upper root directory. On failure to verify the lower root origin, mount will fail with ESTALE. An overlayfs mount with -"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem +"index" enabled will fail with EOPNOTSUPP if the lower filesystem does not support NFS export, lower filesystem does not have a valid UUID or if the upper filesystem does not support extended attributes. -For "metadata only copy up" feature there is no verification mechanism at +For the "metacopy" feature, there is no verification mechanism at mount time. So if same upper is mounted with different set of lower, mount probably will succeed but expect the unexpected later on. So don't do it. It is quite a common practice to copy overlay layers to a different directory tree on the same or different underlying filesystem, and even -to a different machine. With the "inodes index" feature, trying to mount +to a different machine. With the "index" feature, trying to mount the copied layers will fail the verification of the lower root file handle. Nesting overlayfs mounts @@ -547,20 +547,21 @@ filesystem. This is the list of cases that overlayfs doesn't currently handle: -a) POSIX mandates updating st_atime for reads. This is currently not -done in the case when the file resides on a lower layer. + a) POSIX mandates updating st_atime for reads. This is currently not + done in the case when the file resides on a lower layer. -b) If a file residing on a lower layer is opened for read-only and then -memory mapped with MAP_SHARED, then subsequent changes to the file are not -reflected in the memory mapping. + b) If a file residing on a lower layer is opened for read-only and then + memory mapped with MAP_SHARED, then subsequent changes to the file are not + reflected in the memory mapping. -c) If a file residing on a lower layer is being executed, then opening that -file for write or truncating the file will not be denied with ETXTBSY. + c) If a file residing on a lower layer is being executed, then opening that + file for write or truncating the file will not be denied with ETXTBSY. The following options allow overlayfs to act more like a standards compliant filesystem: -1) "redirect_dir" +redirect_dir +```````````` Enabled with the mount option or module option: "redirect_dir=on" or with the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y. @@ -568,7 +569,8 @@ the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y. If this feature is disabled, then rename(2) on a lower or merged directory will fail with EXDEV ("Invalid cross-device link"). -2) "inode index" +index +````` Enabled with the mount option or module option "index=on" or with the kernel config option CONFIG_OVERLAY_FS_INDEX=y. @@ -577,7 +579,8 @@ If this feature is disabled and a file with multiple hard links is copied up, then this will "break" the link. Changes will not be propagated to other names referring to the same inode. -3) "xino" +xino +```` Enabled with the mount option "xino=auto" or "xino=on", with the module option "xino_auto=on" or with the kernel config option @@ -604,7 +607,7 @@ a crash or deadlock. Offline changes, when the overlay is not mounted, are allowed to the upper tree. Offline changes to the lower tree are only allowed if the -"metadata only copy up", "inode index", "xino" and "redirect_dir" features +"metacopy", "index", "xino" and "redirect_dir" features have not been used. If the lower tree is modified and any of these features has been used, the behavior of the overlay is undefined, though it will not result in a crash or deadlock. @@ -644,12 +647,13 @@ directory inode. When encoding a file handle from an overlay filesystem object, the following rules apply: -1. For a non-upper object, encode a lower file handle from lower inode -2. For an indexed object, encode a lower file handle from copy_up origin -3. For a pure-upper object and for an existing non-indexed upper object, - encode an upper file handle from upper inode + 1. For a non-upper object, encode a lower file handle from lower inode + 2. For an indexed object, encode a lower file handle from copy_up origin + 3. For a pure-upper object and for an existing non-indexed upper object, + encode an upper file handle from upper inode The encoded overlay file handle includes: + - Header including path type information (e.g. lower/upper) - UUID of the underlying filesystem - Underlying filesystem encoding of underlying inode @@ -659,15 +663,15 @@ are stored in extended attribute "trusted.overlay.origin". When decoding an overlay file handle, the following steps are followed: -1. Find underlying layer by UUID and path type information. -2. Decode the underlying filesystem file handle to underlying dentry. -3. For a lower file handle, lookup the handle in index directory by name. -4. If a whiteout is found in index, return ESTALE. This represents an - overlay object that was deleted after its file handle was encoded. -5. For a non-directory, instantiate a disconnected overlay dentry from the - decoded underlying dentry, the path type and index inode, if found. -6. For a directory, use the connected underlying decoded dentry, path type - and index, to lookup a connected overlay dentry. + 1. Find underlying layer by UUID and path type information. + 2. Decode the underlying filesystem file handle to underlying dentry. + 3. For a lower file handle, lookup the handle in index directory by name. + 4. If a whiteout is found in index, return ESTALE. This represents an + overlay object that was deleted after its file handle was encoded. + 5. For a non-directory, instantiate a disconnected overlay dentry from the + decoded underlying dentry, the path type and index inode, if found. + 6. For a directory, use the connected underlying decoded dentry, path type + and index, to lookup a connected overlay dentry. Decoding a non-directory file handle may return a disconnected dentry. copy_up of that disconnected dentry will create an upper index entry with @@ -770,9 +774,9 @@ Testsuite There's a testsuite originally developed by David Howells and currently maintained by Amir Goldstein at: - https://github.com/amir73il/unionmount-testsuite.git +https://github.com/amir73il/unionmount-testsuite.git -Run as root: +Run as root:: # cd unionmount-testsuite # ./run --ov --verify diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 878e72b2f8b7..ced3a6761329 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1061,3 +1061,15 @@ export_operations ->encode_fh() no longer has a default implementation to encode FILEID_INO32_GEN* file handles. Filesystems that used the default implementation may use the generic helper generic_encode_ino32_fh() explicitly. + +--- + +**recommended** + +Block device freezing and thawing have been moved to holder operations. + +Before this change, get_active_super() would only be able to find the +superblock of the main block device, i.e., the one stored in sb->s_bdev. Block +device freezing now works for any block device owned by a given superblock, not +just the main block device. The get_active_super() helper and bd_fsfreeze_sb +pointer are gone. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 49ef12df631b..104c6d047d9b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -528,9 +528,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise. -It just shows the current status. +"THPeligible" indicates whether the mapping is eligible for allocating +naturally aligned THP pages of any currently enabled size. 1 if true, 0 +otherwise. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst index df42106bae71..4af8d6207509 100644 --- a/Documentation/filesystems/squashfs.rst +++ b/Documentation/filesystems/squashfs.rst @@ -64,6 +64,66 @@ obtained from this site also. The squashfs-tools development tree is now located on kernel.org git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git +2.1 Mount options +----------------- +=================== ========================================================= +errors=%s Specify whether squashfs errors trigger a kernel panic + or not + + ========== ============================================= + continue errors don't trigger a panic (default) + panic trigger a panic when errors are encountered, + similar to several other filesystems (e.g. + btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs) + + This allows a kernel dump to be saved, + useful for analyzing and debugging the + corruption. + ========== ============================================= +threads=%s Select the decompression mode or the number of threads + + If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set: + + ========== ============================================= + single use single-threaded decompression (default) + + Only one block (data or metadata) can be + decompressed at any one time. This limits + CPU and memory usage to a minimum, but it + also gives poor performance on parallel I/O + workloads when using multiple CPU machines + due to waiting on decompressor availability. + multi use up to two parallel decompressors per core + + If you have a parallel I/O workload and your + system has enough memory, using this option + may improve overall I/O performance. It + dynamically allocates decompressors on a + demand basis. + percpu use a maximum of one decompressor per core + + It uses percpu variables to ensure + decompression is load-balanced across the + cores. + 1|2|3|... configure the number of threads used for + decompression + + The upper limit is num_online_cpus() * 2. + ========== ============================================= + + If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and + SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are + both set: + + ========== ============================================= + 2|3|... configure the number of threads used for + decompression + + The upper limit is num_online_cpus() * 2. + ========== ============================================= + +=================== ========================================================= + 3. Squashfs Filesystem Design ----------------------------- diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 99acc2e98673..dd99ce5912d8 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -823,7 +823,7 @@ cache in your filesystem. The following members are defined: bool (*is_partially_uptodate) (struct folio *, size_t from, size_t count); void (*is_dirty_writeback)(struct folio *, bool *, bool *); - int (*error_remove_page) (struct mapping *mapping, struct page *page); + int (*error_remove_folio)(struct mapping *mapping, struct folio *); int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_deactivate)(struct file *); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); @@ -1034,8 +1034,8 @@ cache in your filesystem. The following members are defined: VM if a folio should be treated as dirty or writeback for the purposes of stalling. -``error_remove_page`` - normally set to generic_error_remove_page if truncation is ok +``error_remove_folio`` + normally set to generic_error_remove_folio if truncation is ok for this address space. Used for memory failure handling. Setting this implies you deal with pages going away under you, unless you have them locked or reference counts increased. diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst new file mode 100644 index 000000000000..ab66c57a5d18 --- /dev/null +++ b/Documentation/filesystems/xfs/index.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +XFS Filesystem Documentation +============================ + +.. toctree:: + :maxdepth: 2 + :numbered: + + xfs-delayed-logging-design + xfs-maintainer-entry-profile + xfs-self-describing-metadata + xfs-online-fsck-design diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst index 6402ab8e370c..6402ab8e370c 100644 --- a/Documentation/filesystems/xfs-delayed-logging-design.rst +++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst diff --git a/Documentation/filesystems/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst index 32b6ac4ca9d6..32b6ac4ca9d6 100644 --- a/Documentation/filesystems/xfs-maintainer-entry-profile.rst +++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index a0678101a7d0..352516feef6f 100644 --- a/Documentation/filesystems/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -962,7 +962,7 @@ disk, but these buffer verifiers cannot provide any consistency checking between metadata structures. For more information, please see the documentation for -Documentation/filesystems/xfs-self-describing-metadata.rst +Documentation/filesystems/xfs/xfs-self-describing-metadata.rst Reverse Mapping --------------- diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst index a10c4ae6955e..a10c4ae6955e 100644 --- a/Documentation/filesystems/xfs-self-describing-metadata.rst +++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst diff --git a/Documentation/i2c/i2c-address-translators.rst b/Documentation/i2c/i2c-address-translators.rst index b22ce9f41ecf..6845c114e472 100644 --- a/Documentation/i2c/i2c-address-translators.rst +++ b/Documentation/i2c/i2c-address-translators.rst @@ -71,7 +71,7 @@ Transaction: - Physical I2C transaction on bus A, slave address 0x20 - ATR chip detects transaction on address 0x20, finds it in table, propagates transaction on bus B with address translated to 0x10, - keeps clock streched on bus A waiting for reply + keeps clock stretched on bus A waiting for reply - Slave X chip (on bus B) detects transaction at its own physical address 0x10 and replies normally - ATR chip stops clock stretching and forwards reply on bus A, diff --git a/Documentation/index.rst b/Documentation/index.rst index 9dfdc826618c..36e61783437c 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -113,6 +113,7 @@ to ReStructured Text format, or are simply too old. :maxdepth: 1 staging/index + RAS/ras Translations diff --git a/Documentation/locking/mutex-design.rst b/Documentation/locking/mutex-design.rst index 78540cd7f54b..7c30b4aa5e28 100644 --- a/Documentation/locking/mutex-design.rst +++ b/Documentation/locking/mutex-design.rst @@ -101,6 +101,24 @@ features that make lock debugging easier and faster: - Detects multi-task circular deadlocks and prints out all affected locks and tasks (and only those tasks). +Mutexes - and most other sleeping locks like rwsems - do not provide an +implicit reference for the memory they occupy, which reference is released +with mutex_unlock(). + +[ This is in contrast with spin_unlock() [or completion_done()], which + APIs can be used to guarantee that the memory is not touched by the + lock implementation after spin_unlock()/completion_done() releases + the lock. ] + +mutex_unlock() may access the mutex structure even after it has internally +released the lock already - so it's not safe for another context to +acquire the mutex and assume that the mutex_unlock() context is not using +the structure anymore. + +The mutex user must ensure that the mutex is not destroyed while a +release operation is still in progress - in other words, callers of +mutex_unlock() must ensure that the mutex stays alive until mutex_unlock() +has returned. Interfaces ---------- diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst index 7ad4bfc2cc03..18cee1edaecb 100644 --- a/Documentation/maintainer/maintainer-entry-profile.rst +++ b/Documentation/maintainer/maintainer-entry-profile.rst @@ -105,4 +105,4 @@ to do something different in the near future. ../driver-api/media/maintainer-entry-profile ../driver-api/vfio-pci-device-specific-driver-acceptance ../nvme/feature-and-quirk-policy - ../filesystems/xfs-maintainer-entry-profile + ../filesystems/xfs/xfs-maintainer-entry-profile diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index c82e3ee20e51..2466d3363af7 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -18,8 +18,6 @@ PTE Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_same | Tests whether both PTE entries are the same | +---------------------------+--------------------------------------------------+ -| pte_bad | Tests a non-table mapped PTE | -+---------------------------+--------------------------------------------------+ | pte_present | Tests a valid mapped PTE | +---------------------------+--------------------------------------------------+ | pte_young | Tests a young PTE | diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index 1f7e0586b5fa..1bb69524a62e 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -5,6 +5,18 @@ Design ====== +.. _damon_design_execution_model_and_data_structures: + +Execution Model and Data Structures +=================================== + +The monitoring-related information including the monitoring request +specification and DAMON-based operation schemes are stored in a data structure +called DAMON ``context``. DAMON executes each context with a kernel thread +called ``kdamond``. Multiple kdamonds could run in parallel, for different +types of monitoring. + + Overall Architecture ==================== @@ -346,6 +358,19 @@ the weight will be respected are up to the underlying prioritization mechanism implementation. +.. _damon_design_damos_quotas_auto_tuning: + +Aim-oriented Feedback-driven Auto-tuning +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Automatic feedback-driven quota tuning. Instead of setting the absolute quota +value, users can repeatedly provide numbers representing how much of their goal +for the scheme is achieved as feedback. DAMOS then automatically tunes the +aggressiveness (the quota) of the corresponding scheme. For example, if DAMOS +is under achieving the goal, DAMOS automatically increases the quota. If DAMOS +is over achieving the goal, it decreases the quota. + + .. _damon_design_damos_watermarks: Watermarks @@ -477,15 +502,3 @@ modules for proactive reclamation and LRU lists manipulation are provided. For more detail, please read the usage documents for those (:doc:`/admin-guide/mm/damon/reclaim` and :doc:`/admin-guide/mm/damon/lru_sort`). - - -.. _damon_design_execution_model_and_data_structures: - -Execution Model and Data Structures -=================================== - -The monitoring-related information including the monitoring request -specification and DAMON-based operation schemes are stored in a data structure -called DAMON ``context``. DAMON executes each context with a kernel thread -called ``kdamond``. Multiple kdamonds could run in parallel, for different -types of monitoring. diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 9a607059ea11..93c9239b9ebe 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -117,7 +117,7 @@ pages: - map/unmap of a PMD entry for the whole THP increment/decrement folio->_entire_mapcount and also increment/decrement - folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount + folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement @@ -156,7 +156,7 @@ Partial unmap and deferred_split_folio() Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use -in page_remove_rmap() and queue the THP for splitting if memory pressure +in folio_remove_rmap_*() and queue the THP for splitting if memory pressure comes. Splitting will free up unused subpages. Splitting the page right away is not an option due to locking context in diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 67f1338440a5..b6a07a26b10d 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -486,7 +486,7 @@ munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing. -For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). @@ -511,7 +511,7 @@ userspace; truncation even unmaps and deletes any private anonymous pages which had been Copied-On-Write from the file pages now being truncated. Mlocked pages can be munlocked and deleted in this way: like with munmap(), -for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls +for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 4dfe0d9a57bb..7afff42612e9 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -2511,7 +2511,7 @@ temp_valid_lft - INTEGER temp_prefered_lft - INTEGER Preferred lifetime (in seconds) for temporary addresses. If temp_prefered_lft is less than the minimum required lifetime (typically - 5 seconds), the preferred lifetime is the minimum required. If + 5 seconds), temporary addresses will not be created. If temp_prefered_lft is greater than temp_valid_lft, the preferred lifetime is temp_valid_lft. diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst index 30a3be3c48f3..dca15d15feaf 100644 --- a/Documentation/networking/packet_mmap.rst +++ b/Documentation/networking/packet_mmap.rst @@ -263,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second argument is "order" or a power of two number of pages, that is (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, order=2 ==> 16384 bytes, etc. The maximum size of a -region allocated by __get_free_pages is determined by the MAX_ORDER macro. More -precisely the limit can be calculated as:: +region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro. +More precisely the limit can be calculated as:: - PAGE_SIZE << MAX_ORDER + PAGE_SIZE << MAX_PAGE_ORDER In a i386 architecture PAGE_SIZE is 4096 bytes - In a 2.4/i386 kernel MAX_ORDER is 10 - In a 2.6/i386 kernel MAX_ORDER is 11 + In a 2.4/i386 kernel MAX_PAGE_ORDER is 10 + In a 2.6/i386 kernel MAX_PAGE_ORDER is 11 So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel respectively, with an i386 architecture. User space programs can include /usr/include/sys/user.h and -/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. +/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations. The pagesize can also be determined dynamically with the getpagesize (2) system call. @@ -324,7 +324,7 @@ Definitions: (see /proc/slabinfo) <pointer size> depends on the architecture -- ``sizeof(void *)`` <page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) -<max-order> is the value defined with MAX_ORDER +<max-order> is the value defined with MAX_PAGE_ORDER <frame size> it's an upper bound of frame's capture size (more on this later) ============== ================================================================ diff --git a/Documentation/power/freezing-of-tasks.rst b/Documentation/power/freezing-of-tasks.rst index 53b6a56c4635..df9755bfbd94 100644 --- a/Documentation/power/freezing-of-tasks.rst +++ b/Documentation/power/freezing-of-tasks.rst @@ -14,27 +14,28 @@ architectures). II. How does it work? ===================== -There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN -and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have -PF_NOFREEZE unset (all user space processes and some kernel threads) are -regarded as 'freezable' and treated in a special way before the system enters a -suspend state as well as before a hibernation image is created (in what follows -we only consider hibernation, but the description also applies to suspend). +There is one per-task flag (PF_NOFREEZE) and three per-task states +(TASK_FROZEN, TASK_FREEZABLE and __TASK_FREEZABLE_UNSAFE) used for that. +The tasks that have PF_NOFREEZE unset (all user space tasks and some kernel +threads) are regarded as 'freezable' and treated in a special way before the +system enters a sleep state as well as before a hibernation image is created +(hibernation is directly covered by what follows, but the description applies +to system-wide suspend too). Namely, as the first step of the hibernation procedure the function freeze_processes() (defined in kernel/power/process.c) is called. A system-wide -variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate -whether the system is to undergo a freezing operation. And freeze_processes() -sets this variable. After this, it executes try_to_freeze_tasks() that sends a -fake signal to all user space processes, and wakes up all the kernel threads. -All freezable tasks must react to that by calling try_to_freeze(), which -results in a call to __refrigerator() (defined in kernel/freezer.c), which sets -the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes -it loop until PF_FROZEN is cleared for it. Then, we say that the task is -'frozen' and therefore the set of functions handling this mechanism is referred -to as 'the freezer' (these functions are defined in kernel/power/process.c, -kernel/freezer.c & include/linux/freezer.h). User space processes are generally -frozen before kernel threads. +static key freezer_active (as opposed to a per-task flag or state) is used to +indicate whether the system is to undergo a freezing operation. And +freeze_processes() sets this static key. After this, it executes +try_to_freeze_tasks() that sends a fake signal to all user space processes, and +wakes up all the kernel threads. All freezable tasks must react to that by +calling try_to_freeze(), which results in a call to __refrigerator() (defined +in kernel/freezer.c), which changes the task's state to TASK_FROZEN, and makes +it loop until it is woken by an explicit TASK_FROZEN wakeup. Then, that task +is regarded as 'frozen' and so the set of functions handling this mechanism is +referred to as 'the freezer' (these functions are defined in +kernel/power/process.c, kernel/freezer.c & include/linux/freezer.h). User space +tasks are generally frozen before kernel threads. __refrigerator() must not be called directly. Instead, use the try_to_freeze() function (defined in include/linux/freezer.h), that checks @@ -43,31 +44,40 @@ if the task is to be frozen and makes the task enter __refrigerator(). For user space processes try_to_freeze() is called automatically from the signal-handling code, but the freezable kernel threads need to call it explicitly in suitable places or use the wait_event_freezable() or -wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) -that combine interruptible sleep with checking if the task is to be frozen and -calling try_to_freeze(). The main loop of a freezable kernel thread may look +wait_event_freezable_timeout() macros (defined in include/linux/wait.h) +that put the task to sleep (TASK_INTERRUPTIBLE) or freeze it (TASK_FROZEN) if +freezer_active is set. The main loop of a freezable kernel thread may look like the following one:: set_freezable(); - do { - hub_events(); - wait_event_freezable(khubd_wait, - !list_empty(&hub_event_list) || - kthread_should_stop()); - } while (!kthread_should_stop() || !list_empty(&hub_event_list)); - -(from drivers/usb/core/hub.c::hub_thread()). - -If a freezable kernel thread fails to call try_to_freeze() after the freezer has -initiated a freezing operation, the freezing of tasks will fail and the entire -hibernation operation will be cancelled. For this reason, freezable kernel -threads must call try_to_freeze() somewhere or use one of the + + while (true) { + struct task_struct *tsk = NULL; + + wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); + spin_lock_irq(&oom_reaper_lock); + if (oom_reaper_list != NULL) { + tsk = oom_reaper_list; + oom_reaper_list = tsk->oom_reaper_list; + } + spin_unlock_irq(&oom_reaper_lock); + + if (tsk) + oom_reap_task(tsk); + } + +(from mm/oom_kill.c::oom_reaper()). + +If a freezable kernel thread is not put to the frozen state after the freezer +has initiated a freezing operation, the freezing of tasks will fail and the +entire system-wide transition will be cancelled. For this reason, freezable +kernel threads must call try_to_freeze() somewhere or use one of the wait_event_freezable() and wait_event_freezable_timeout() macros. After the system memory state has been restored from a hibernation image and devices have been reinitialized, the function thaw_processes() is called in -order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that -have been frozen leave __refrigerator() and continue running. +order to wake up each frozen task. Then, the tasks that have been frozen leave +__refrigerator() and continue running. Rationale behind the functions dealing with freezing and thawing of tasks @@ -96,7 +106,8 @@ III. Which kernel threads are freezable? Kernel threads are not freezable by default. However, a kernel thread may clear PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE directly is not allowed). From this point it is regarded as freezable -and must call try_to_freeze() in a suitable place. +and must call try_to_freeze() or variants of wait_event_freezable() in a +suitable place. IV. Why do we do that? ====================== diff --git a/Documentation/scheduler/sched-design-CFS.rst b/Documentation/scheduler/sched-design-CFS.rst index f68919800f05..6cffffe26500 100644 --- a/Documentation/scheduler/sched-design-CFS.rst +++ b/Documentation/scheduler/sched-design-CFS.rst @@ -180,7 +180,7 @@ This is the (partial) list of the hooks: compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-most end of the red-black tree. - - check_preempt_curr(...) + - wakeup_preempt(...) This function checks if a task that entered the runnable state should preempt the currently running task. @@ -189,10 +189,10 @@ This is the (partial) list of the hooks: This function chooses the most appropriate task eligible to run next. - - set_curr_task(...) + - set_next_task(...) - This function is called when a task changes its scheduling class or changes - its task group. + This function is called when a task changes its scheduling class, changes + its task group or is scheduled. - task_tick(...) diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst index 32c7d69fc86c..803fba8fc714 100644 --- a/Documentation/scheduler/schedutil.rst +++ b/Documentation/scheduler/schedutil.rst @@ -90,8 +90,8 @@ For more detail see: - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" -UTIL_EST / UTIL_EST_FASTUP -========================== +UTIL_EST +======== Because periodic tasks have their averages decayed while they sleep, even though when running their expected utilization will be the same, they suffer a @@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a To alleviate this (a default enabled option) UTIL_EST drives an Infinite Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is -highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR -filter to instantly increase and only decay on decrease. +highest. UTIL_EST filters to instantly increase and only decay on decrease. A further runqueue wide sum (of runnable tasks) is maintained of: diff --git a/Documentation/spi/pxa2xx.rst b/Documentation/spi/pxa2xx.rst index 04f2a3856c40..19479b801826 100644 --- a/Documentation/spi/pxa2xx.rst +++ b/Documentation/spi/pxa2xx.rst @@ -3,13 +3,13 @@ PXA2xx SPI on SSP driver HOWTO ============================== This a mini HOWTO on the pxa2xx_spi driver. The driver turns a PXA2xx -synchronous serial port into an SPI master controller +synchronous serial port into an SPI host controller (see Documentation/spi/spi-summary.rst). The driver has the following features - Support for any PXA2xx and compatible SSP. - SSP PIO and SSP DMA data transfers. - External and Internal (SSPFRM) chip selects. -- Per slave device (chip) configuration. +- Per peripheral device (chip) configuration. - Full suspend, freeze, resume support. The driver is built around a &struct spi_message FIFO serviced by kernel @@ -17,10 +17,10 @@ thread. The kernel thread, spi_pump_messages(), drives message FIFO and is responsible for queuing SPI transactions and setting up and launching the DMA or interrupt driven transfers. -Declaring PXA2xx Master Controllers ------------------------------------ -Typically, for a legacy platform, an SPI master is defined in the -arch/.../mach-*/board-*.c as a "platform device". The master configuration +Declaring PXA2xx host controllers +--------------------------------- +Typically, for a legacy platform, an SPI host controller is defined in the +arch/.../mach-*/board-*.c as a "platform device". The host controller configuration is passed to the driver via a table found in include/linux/spi/pxa2xx_spi.h:: struct pxa2xx_spi_controller { @@ -30,7 +30,7 @@ is passed to the driver via a table found in include/linux/spi/pxa2xx_spi.h:: }; The "pxa2xx_spi_controller.num_chipselect" field is used to determine the number of -slave device (chips) attached to this SPI master. +peripheral devices (chips) attached to this SPI host controller. The "pxa2xx_spi_controller.enable_dma" field informs the driver that SSP DMA should be used. This caused the driver to acquire two DMA channels: Rx channel and @@ -40,8 +40,8 @@ See the "PXA2xx Developer Manual" section "DMA Controller". For the new platforms the description of the controller and peripheral devices comes from Device Tree or ACPI. -NSSP MASTER SAMPLE ------------------- +NSSP HOST SAMPLE +---------------- Below is a sample configuration using the PXA255 NSSP for a legacy platform:: static struct resource pxa_spi_nssp_resources[] = { @@ -57,7 +57,7 @@ Below is a sample configuration using the PXA255 NSSP for a legacy platform:: }, }; - static struct pxa2xx_spi_controller pxa_nssp_master_info = { + static struct pxa2xx_spi_controller pxa_nssp_controller_info = { .num_chipselect = 1, /* Matches the number of chips attached to NSSP */ .enable_dma = 1, /* Enables NSSP DMA */ }; @@ -68,7 +68,7 @@ Below is a sample configuration using the PXA255 NSSP for a legacy platform:: .resource = pxa_spi_nssp_resources, .num_resources = ARRAY_SIZE(pxa_spi_nssp_resources), .dev = { - .platform_data = &pxa_nssp_master_info, /* Passed to driver */ + .platform_data = &pxa_nssp_controller_info, /* Passed to driver */ }, }; @@ -81,17 +81,17 @@ Below is a sample configuration using the PXA255 NSSP for a legacy platform:: (void)platform_add_device(devices, ARRAY_SIZE(devices)); } -Declaring Slave Devices ------------------------ -Typically, for a legacy platform, each SPI slave (chip) is defined in the +Declaring peripheral devices +---------------------------- +Typically, for a legacy platform, each SPI peripheral device (chip) is defined in the arch/.../mach-*/board-*.c using the "spi_board_info" structure found in "linux/spi/spi.h". See "Documentation/spi/spi-summary.rst" for additional information. -Each slave device attached to the PXA must provide slave specific configuration +Each peripheral device (chip) attached to the PXA2xx must provide specific chip configuration information via the structure "pxa2xx_spi_chip" found in -"include/linux/spi/pxa2xx_spi.h". The pxa2xx_spi master controller driver -will uses the configuration whenever the driver communicates with the slave +"include/linux/spi/pxa2xx_spi.h". The PXA2xx host controller driver will use +the configuration whenever the driver communicates with the peripheral device. All fields are optional. :: @@ -123,7 +123,7 @@ dma_burst_size == 0. The "pxa2xx_spi_chip.timeout" fields is used to efficiently handle trailing bytes in the SSP receiver FIFO. The correct value for this field is dependent on the SPI bus speed ("spi_board_info.max_speed_hz") and the specific -slave device. Please note that the PXA2xx SSP 1 does not support trailing byte +peripheral device. Please note that the PXA2xx SSP 1 does not support trailing byte timeouts and must busy-wait any trailing bytes. NOTE: the SPI driver cannot control the chip select if SSPFRM is used, so the @@ -132,8 +132,8 @@ asserted around the complete message. Use SSPFRM as a GPIO (through a descriptor to accommodate these chips. -NSSP SLAVE SAMPLE ------------------ +NSSP PERIPHERAL SAMPLE +---------------------- For a legacy platform or in some other cases, the pxa2xx_spi_chip structure is passed to the pxa2xx_spi driver in the "spi_board_info.controller_data" field. Below is a sample configuration using the PXA255 NSSP. @@ -161,16 +161,16 @@ field. Below is a sample configuration using the PXA255 NSSP. .bus_num = 2, /* Framework bus number */ .chip_select = 0, /* Framework chip select */ .platform_data = NULL; /* No spi_driver specific config */ - .controller_data = &cs8415a_chip_info, /* Master chip config */ - .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */ + .controller_data = &cs8415a_chip_info, /* Host controller config */ + .irq = STREETRACER_APCI_IRQ, /* Peripheral device interrupt */ }, { .modalias = "cs8405a", /* Name of spi_driver for this device */ .max_speed_hz = 3686400, /* Run SSP as fast a possible */ .bus_num = 2, /* Framework bus number */ .chip_select = 1, /* Framework chip select */ - .controller_data = &cs8405a_chip_info, /* Master chip config */ - .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */ + .controller_data = &cs8405a_chip_info, /* Host controller config */ + .irq = STREETRACER_APCI_IRQ, /* Peripheral device interrupt */ }, }; @@ -193,17 +193,14 @@ mode supports both coherent and stream based DMA mappings. The following logic is used to determine the type of I/O to be used on a per "spi_transfer" basis:: - if !enable_dma then - always use PIO transfers + if spi_message.len > 65536 then + if spi_message.is_dma_mapped or rx_dma_buf != 0 or tx_dma_buf != 0 then + reject premapped transfers - if spi_message.len > 8191 then print "rate limited" warning use PIO transfers - if spi_message.is_dma_mapped and rx_dma_buf != 0 and tx_dma_buf != 0 then - use coherent DMA mode - - if rx_buf and tx_buf are aligned on 8 byte boundary then + if enable_dma and the size is in the range [DMA burst size..65536] then use streaming DMA mode otherwise diff --git a/Documentation/translations/ja_JP/SubmitChecklist b/Documentation/translations/ja_JP/SubmitChecklist index 4429447b0965..1759c6b452d6 100644 --- a/Documentation/translations/ja_JP/SubmitChecklist +++ b/Documentation/translations/ja_JP/SubmitChecklist @@ -56,8 +56,8 @@ Linux カーネルパッチ投稿者向けチェックリスト 9: sparseを利用してちゃんとしたコードチェックをしてください。 -10: 'make checkstack' と 'make namespacecheck' を利用し、問題が発見されたら - 修正してください。'make checkstack' は明示的に問題を示しませんが、どれか +10: 'make checkstack' を利用し、問題が発見されたら修正してください。 + 'make checkstack' は明示的に問題を示しませんが、どれか 1つの関数が512バイトより大きいスタックを使っていれば、修正すべき候補と なります。 diff --git a/Documentation/translations/zh_CN/process/submit-checklist.rst b/Documentation/translations/zh_CN/process/submit-checklist.rst index 3d6ee21c74ae..10536b74aeec 100644 --- a/Documentation/translations/zh_CN/process/submit-checklist.rst +++ b/Documentation/translations/zh_CN/process/submit-checklist.rst @@ -53,8 +53,7 @@ Linux内核补丁提交检查单 9) 通过 sparse 清查。 (参见 Documentation/translations/zh_CN/dev-tools/sparse.rst ) -10) 使用 ``make checkstack`` 和 ``make namespacecheck`` 并修复他们发现的任何 - 问题。 +10) 使用 ``make checkstack`` 并修复他们发现的任何问题。 .. note:: diff --git a/Documentation/translations/zh_CN/scheduler/sched-design-CFS.rst b/Documentation/translations/zh_CN/scheduler/sched-design-CFS.rst index 3076402406c4..abc6709ec3b2 100644 --- a/Documentation/translations/zh_CN/scheduler/sched-design-CFS.rst +++ b/Documentation/translations/zh_CN/scheduler/sched-design-CFS.rst @@ -80,7 +80,7 @@ p->se.vruntime。一旦p->se.vruntime变得足够大,其它的任务将成为 CFS使用纳秒粒度的计时,不依赖于任何jiffies或HZ的细节。因此CFS并不像之前的调度器那样 有“时间片”的概念,也没有任何启发式的设计。唯一可调的参数(你需要打开CONFIG_SCHED_DEBUG)是: - /sys/kernel/debug/sched/min_granularity_ns + /sys/kernel/debug/sched/base_slice_ns 它可以用来将调度器从“桌面”模式(也就是低时延)调节为“服务器”(也就是高批处理)模式。 它的默认设置是适合桌面的工作负载。SCHED_BATCH也被CFS调度器模块处理。 @@ -147,7 +147,7 @@ array)。 这个函数的行为基本上是出队,紧接着入队,除非compat_yield sysctl被开启。在那种情况下, 它将调度实体放在红黑树的最右端。 - - check_preempt_curr(...) + - wakeup_preempt(...) 这个函数检查进入可运行状态的任务能否抢占当前正在运行的任务。 @@ -155,9 +155,9 @@ array)。 这个函数选择接下来最适合运行的任务。 - - set_curr_task(...) + - set_next_task(...) - 这个函数在任务改变调度类或改变任务组时被调用。 + 这个函数在任务改变调度类,改变任务组时,或者任务被调度时被调用。 - task_tick(...) diff --git a/Documentation/translations/zh_CN/scheduler/schedutil.rst b/Documentation/translations/zh_CN/scheduler/schedutil.rst index d1ea68007520..7c8d87f21c42 100644 --- a/Documentation/translations/zh_CN/scheduler/schedutil.rst +++ b/Documentation/translations/zh_CN/scheduler/schedutil.rst @@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最 - Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" -UTIL_EST / UTIL_EST_FASTUP -========================== +UTIL_EST +======== 由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同, 因此它们在再次运行后会面临(DVFS)的上涨。 为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应 (Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。 -另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加, -仅在利用率下降时衰减。 +UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。 进一步,运行队列的(可运行任务的)利用率之和由下式计算: diff --git a/Documentation/translations/zh_TW/process/submit-checklist.rst b/Documentation/translations/zh_TW/process/submit-checklist.rst index 942962d1e2f4..dda456a73147 100644 --- a/Documentation/translations/zh_TW/process/submit-checklist.rst +++ b/Documentation/translations/zh_TW/process/submit-checklist.rst @@ -56,8 +56,7 @@ Linux內核補丁提交檢查單 9) 通過 sparse 清查。 (參見 Documentation/translations/zh_CN/dev-tools/sparse.rst ) -10) 使用 ``make checkstack`` 和 ``make namespacecheck`` 並修復他們發現的任何 - 問題。 +10) 使用 ``make checkstack`` 並修復他們發現的任何問題。 .. note:: diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 031df47a7c19..8be8b1979194 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -33,6 +33,7 @@ place where this information is gathered. sysfs-platform_profile vduse futex2 + lsm .. only:: subproject and html diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 4ea5b837399a..d8b6cb1a3636 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -349,6 +349,10 @@ Code Seq# Include File Comments <mailto:vgo@ratio.de> 0xB1 00-1F PPPoX <mailto:mostrows@styx.uwaterloo.ca> +0xB2 00 arch/powerpc/include/uapi/asm/papr-vpd.h powerpc/pseries VPD API + <mailto:linuxppc-dev> +0xB2 01-02 arch/powerpc/include/uapi/asm/papr-sysparm.h powerpc/pseries system parameter API + <mailto:linuxppc-dev> 0xB3 00 linux/mmc/ioctl.h 0xB4 00-0F linux/gpio.h <mailto:linux-gpio@vger.kernel.org> 0xB5 00-0F uapi/linux/rpmsg.h <mailto:linux-remoteproc@vger.kernel.org> diff --git a/Documentation/userspace-api/lsm.rst b/Documentation/userspace-api/lsm.rst new file mode 100644 index 000000000000..a76da373841b --- /dev/null +++ b/Documentation/userspace-api/lsm.rst @@ -0,0 +1,73 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. Copyright (C) 2022 Casey Schaufler <casey@schaufler-ca.com> +.. Copyright (C) 2022 Intel Corporation + +===================================== +Linux Security Modules +===================================== + +:Author: Casey Schaufler +:Date: July 2023 + +Linux security modules (LSM) provide a mechanism to implement +additional access controls to the Linux security policies. + +The various security modules may support any of these attributes: + +``LSM_ATTR_CURRENT`` is the current, active security context of the +process. +The proc filesystem provides this value in ``/proc/self/attr/current``. +This is supported by the SELinux, Smack and AppArmor security modules. +Smack also provides this value in ``/proc/self/attr/smack/current``. +AppArmor also provides this value in ``/proc/self/attr/apparmor/current``. + +``LSM_ATTR_EXEC`` is the security context of the process at the time the +current image was executed. +The proc filesystem provides this value in ``/proc/self/attr/exec``. +This is supported by the SELinux and AppArmor security modules. +AppArmor also provides this value in ``/proc/self/attr/apparmor/exec``. + +``LSM_ATTR_FSCREATE`` is the security context of the process used when +creating file system objects. +The proc filesystem provides this value in ``/proc/self/attr/fscreate``. +This is supported by the SELinux security module. + +``LSM_ATTR_KEYCREATE`` is the security context of the process used when +creating key objects. +The proc filesystem provides this value in ``/proc/self/attr/keycreate``. +This is supported by the SELinux security module. + +``LSM_ATTR_PREV`` is the security context of the process at the time the +current security context was set. +The proc filesystem provides this value in ``/proc/self/attr/prev``. +This is supported by the SELinux and AppArmor security modules. +AppArmor also provides this value in ``/proc/self/attr/apparmor/prev``. + +``LSM_ATTR_SOCKCREATE`` is the security context of the process used when +creating socket objects. +The proc filesystem provides this value in ``/proc/self/attr/sockcreate``. +This is supported by the SELinux security module. + +Kernel interface +================ + +Set a security attribute of the current process +----------------------------------------------- + +.. kernel-doc:: security/lsm_syscalls.c + :identifiers: sys_lsm_set_self_attr + +Get the specified security attributes of the current process +------------------------------------------------------------ + +.. kernel-doc:: security/lsm_syscalls.c + :identifiers: sys_lsm_get_self_attr + +.. kernel-doc:: security/lsm_syscalls.c + :identifiers: sys_lsm_list_modules + +Additional documentation +======================== + +* Documentation/security/lsm.rst +* Documentation/security/lsm-development.rst |