summaryrefslogtreecommitdiff
path: root/Documentation/driver-api/thermal
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/driver-api/thermal')
-rw-r--r--Documentation/driver-api/thermal/cpu-idle-cooling.rst199
-rw-r--r--Documentation/driver-api/thermal/exynos_thermal.rst8
-rw-r--r--Documentation/driver-api/thermal/index.rst3
-rw-r--r--Documentation/driver-api/thermal/intel_dptf.rst381
-rw-r--r--Documentation/driver-api/thermal/intel_powerclamp.rst320
-rw-r--r--Documentation/driver-api/thermal/nouveau_thermal.rst4
-rw-r--r--Documentation/driver-api/thermal/power_allocator.rst12
-rw-r--r--Documentation/driver-api/thermal/sysfs-api.rst319
8 files changed, 614 insertions, 632 deletions
diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
new file mode 100644
index 000000000000..c2a7ca676853
--- /dev/null
+++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst
@@ -0,0 +1,199 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+CPU Idle Cooling
+================
+
+Situation:
+----------
+
+Under certain circumstances a SoC can reach a critical temperature
+limit and is unable to stabilize the temperature around a temperature
+control. When the SoC has to stabilize the temperature, the kernel can
+act on a cooling device to mitigate the dissipated power. When the
+critical temperature is reached, a decision must be taken to reduce
+the temperature, that, in turn impacts performance.
+
+Another situation is when the silicon temperature continues to
+increase even after the dynamic leakage is reduced to its minimum by
+clock gating the component. This runaway phenomenon can continue due
+to the static leakage. The only solution is to power down the
+component, thus dropping the dynamic and static leakage that will
+allow the component to cool down.
+
+Last but not least, the system can ask for a specific power budget but
+because of the OPP density, we can only choose an OPP with a power
+budget lower than the requested one and under-utilize the CPU, thus
+losing performance. In other words, one OPP under-utilizes the CPU
+with a power less than the requested power budget and the next OPP
+exceeds the power budget. An intermediate OPP could have been used if
+it were present.
+
+Solutions:
+----------
+
+If we can remove the static and the dynamic leakage for a specific
+duration in a controlled period, the SoC temperature will
+decrease. Acting on the idle state duration or the idle cycle
+injection period, we can mitigate the temperature by modulating the
+power budget.
+
+The Operating Performance Point (OPP) density has a great influence on
+the control precision of cpufreq, however different vendors have a
+plethora of OPP density, and some have large power gap between OPPs,
+that will result in loss of performance during thermal control and
+loss of power in other scenarios.
+
+At a specific OPP, we can assume that injecting idle cycle on all CPUs
+belong to the same cluster, with a duration greater than the cluster
+idle state target residency, we lead to dropping the static and the
+dynamic leakage for this period (modulo the energy needed to enter
+this state). So the sustainable power with idle cycles has a linear
+relation with the OPP’s sustainable power and can be computed with a
+coefficient similar to::
+
+ Power(IdleCycle) = Coef x Power(OPP)
+
+Idle Injection:
+---------------
+
+The base concept of the idle injection is to force the CPU to go to an
+idle state for a specified time each control cycle, it provides
+another way to control CPU power and heat in addition to
+cpufreq. Ideally, if all CPUs belonging to the same cluster, inject
+their idle cycles synchronously, the cluster can reach its power down
+state with a minimum power consumption and reduce the static leakage
+to almost zero. However, these idle cycles injection will add extra
+latencies as the CPUs will have to wakeup from a deep sleep state.
+
+We use a fixed duration of idle injection that gives an acceptable
+performance penalty and a fixed latency. Mitigation can be increased
+or decreased by modulating the duty cycle of the idle injection.
+
+::
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______________________|_______|___________
+
+ <------>
+ idle <---------------------->
+ running
+
+ <----------------------------->
+ duty cycle 25%
+
+
+The implementation of the cooling device bases the number of states on
+the duty cycle percentage. When no mitigation is happening the cooling
+device state is zero, meaning the duty cycle is 0%.
+
+When the mitigation begins, depending on the governor's policy, a
+starting state is selected. With a fixed idle duration and the duty
+cycle (aka the cooling device state), the running duration can be
+computed.
+
+The governor will change the cooling device state thus the duty cycle
+and this variation will modulate the cooling effect.
+
+::
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______________|_______|___________
+
+ <------>
+ idle <-------------->
+ running
+
+ <--------------------->
+ duty cycle 33%
+
+
+ ^
+ |
+ |
+ |------- -------
+ |_______|_______|_______|___________
+
+ <------>
+ idle <------>
+ running
+
+ <------------->
+ duty cycle 50%
+
+The idle injection duration value must comply with the constraints:
+
+- It is less than or equal to the latency we tolerate when the
+ mitigation begins. It is platform dependent and will depend on the
+ user experience, reactivity vs performance trade off we want. This
+ value should be specified.
+
+- It is greater than the idle state’s target residency we want to go
+ for thermal mitigation, otherwise we end up consuming more energy.
+
+Power considerations
+--------------------
+
+When we reach the thermal trip point, we have to sustain a specified
+power for a specific temperature but at this time we consume::
+
+ Power = Capacitance x Voltage^2 x Frequency x Utilisation
+
+... which is more than the sustainable power (or there is something
+wrong in the system setup). The ‘Capacitance’ and ‘Utilisation’ are a
+fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially
+because we don’t want to change the OPP. We can group the
+‘Capacitance’ and the ‘Utilisation’ into a single term which is the
+‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have::
+
+ Pdyn = Cdyn x Voltage^2 x Frequency
+
+The power allocator governor will ask us somehow to reduce our power
+in order to target the sustainable power defined in the device
+tree. So with the idle injection mechanism, we want an average power
+(Ptarget) resulting in an amount of time running at full power on a
+specific OPP and idle another amount of time. That could be put in a
+equation::
+
+ P(opp)target = ((Trunning x (P(opp)running) + (Tidle x P(opp)idle)) /
+ (Trunning + Tidle)
+
+ ...
+
+ Tidle = Trunning x ((P(opp)running / P(opp)target) - 1)
+
+At this point if we know the running period for the CPU, that gives us
+the idle injection we need. Alternatively if we have the idle
+injection duration, we can compute the running duration with::
+
+ Trunning = Tidle / ((P(opp)running / P(opp)target) - 1)
+
+Practically, if the running power is less than the targeted power, we
+end up with a negative time value, so obviously the equation usage is
+bound to a power reduction, hence a higher OPP is needed to have the
+running power greater than the targeted power.
+
+However, in this demonstration we ignore three aspects:
+
+ * The static leakage is not defined here, we can introduce it in the
+ equation but assuming it will be zero most of the time as it is
+ difficult to get the values from the SoC vendors
+
+ * The idle state wake up latency (or entry + exit latency) is not
+ taken into account, it must be added in the equation in order to
+ rigorously compute the idle injection
+
+ * The injected idle duration must be greater than the idle state
+ target residency, otherwise we end up consuming more energy and
+ potentially invert the mitigation effect
+
+So the final equation is::
+
+ Trunning = (Tidle - Twakeup ) x
+ (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target )
diff --git a/Documentation/driver-api/thermal/exynos_thermal.rst b/Documentation/driver-api/thermal/exynos_thermal.rst
index 5bd556566c70..764df4ab584d 100644
--- a/Documentation/driver-api/thermal/exynos_thermal.rst
+++ b/Documentation/driver-api/thermal/exynos_thermal.rst
@@ -4,7 +4,7 @@ Kernel driver exynos_tmu
Supported chips:
-* ARM SAMSUNG EXYNOS4, EXYNOS5 series of SoC
+* ARM Samsung Exynos4, Exynos5 series of SoC
Datasheet: Not publicly available
@@ -14,7 +14,7 @@ Authors: Amit Daniel <amit.daniel@samsung.com>
TMU controller Description:
---------------------------
-This driver allows to read temperature inside SAMSUNG EXYNOS4/5 series of SoC.
+This driver allows to read temperature inside Samsung Exynos4/5 series of SoC.
The chip only exposes the measured 8-bit temperature code value
through a register.
@@ -43,7 +43,7 @@ The three equations are:
Trimming info for 85 degree Celsius (stored at TRIMINFO register)
Temperature code measured at 85 degree Celsius which is unchanged
-TMU(Thermal Management Unit) in EXYNOS4/5 generates interrupt
+TMU(Thermal Management Unit) in Exynos4/5 generates interrupt
when temperature exceeds pre-defined levels.
The maximum number of configurable threshold is five.
The threshold levels are defined as follows::
@@ -67,7 +67,7 @@ TMU driver description:
The exynos thermal driver is structured as::
Kernel Core thermal framework
- (thermal_core.c, step_wise.c, cpu_cooling.c)
+ (thermal_core.c, step_wise.c, cpufreq_cooling.c)
^
|
|
diff --git a/Documentation/driver-api/thermal/index.rst b/Documentation/driver-api/thermal/index.rst
index 5ba61d19c6ae..a886028014ab 100644
--- a/Documentation/driver-api/thermal/index.rst
+++ b/Documentation/driver-api/thermal/index.rst
@@ -8,11 +8,12 @@ Thermal
:maxdepth: 1
cpu-cooling-api
+ cpu-idle-cooling
sysfs-api
power_allocator
exynos_thermal
exynos_thermal_emulation
- intel_powerclamp
nouveau_thermal
x86_pkg_temperature_thermal
+ intel_dptf
diff --git a/Documentation/driver-api/thermal/intel_dptf.rst b/Documentation/driver-api/thermal/intel_dptf.rst
new file mode 100644
index 000000000000..8fb8c5b2d685
--- /dev/null
+++ b/Documentation/driver-api/thermal/intel_dptf.rst
@@ -0,0 +1,381 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================================
+Intel(R) Dynamic Platform and Thermal Framework Sysfs Interface
+===============================================================
+
+:Copyright: © 2022 Intel Corporation
+
+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
+
+Introduction
+------------
+
+Intel(R) Dynamic Platform and Thermal Framework (DPTF) is a platform
+level hardware/software solution for power and thermal management.
+
+As a container for multiple power/thermal technologies, DPTF provides
+a coordinated approach for different policies to effect the hardware
+state of a system.
+
+Since it is a platform level framework, this has several components.
+Some parts of the technology is implemented in the firmware and uses
+ACPI and PCI devices to expose various features for monitoring and
+control. Linux has a set of kernel drivers exposing hardware interface
+to user space. This allows user space thermal solutions like
+"Linux Thermal Daemon" to read platform specific thermal and power
+tables to deliver adequate performance while keeping the system under
+thermal limits.
+
+DPTF ACPI Drivers interface
+----------------------------
+
+:file:`/sys/bus/platform/devices/<N>/uuids`, where <N>
+=INT3400|INTC1040|INTC1041|INTC10A0
+
+``available_uuids`` (RO)
+ A set of UUIDs strings presenting available policies
+ which should be notified to the firmware when the
+ user space can support those policies.
+
+ UUID strings:
+
+ "42A441D6-AE6A-462b-A84B-4A8CE79027D3" : Passive 1
+
+ "3A95C389-E4B8-4629-A526-C52C88626BAE" : Active
+
+ "97C68AE7-15FA-499c-B8C9-5DA81D606E0A" : Critical
+
+ "63BE270F-1C11-48FD-A6F7-3AF253FF3E2D" : Adaptive performance
+
+ "5349962F-71E6-431D-9AE8-0A635B710AEE" : Emergency call
+
+ "9E04115A-AE87-4D1C-9500-0F3E340BFE75" : Passive 2
+
+ "F5A35014-C209-46A4-993A-EB56DE7530A1" : Power Boss
+
+ "6ED722A7-9240-48A5-B479-31EEF723D7CF" : Virtual Sensor
+
+ "16CAF1B7-DD38-40ED-B1C1-1B8A1913D531" : Cooling mode
+
+ "BE84BABF-C4D4-403D-B495-3128FD44dAC1" : HDC
+
+``current_uuid`` (RW)
+ User space can write strings from available UUIDs, one at a
+ time.
+
+:file:`/sys/bus/platform/devices/<N>/`, where <N>
+=INT3400|INTC1040|INTC1041|INTC10A0
+
+``imok`` (WO)
+ User space daemon write 1 to respond to firmware event
+ for sending keep alive notification. User space receives
+ THERMAL_EVENT_KEEP_ALIVE kobject uevent notification when
+ firmware calls for user space to respond with imok ACPI
+ method.
+
+``odvp*`` (RO)
+ Firmware thermal status variable values. Thermal tables
+ calls for different processing based on these variable
+ values.
+
+``data_vault`` (RO)
+ Binary thermal table. Refer to
+ https:/github.com/intel/thermal_daemon for decoding
+ thermal table.
+
+``production_mode`` (RO)
+ When different from zero, manufacturer locked thermal configuration
+ from further changes.
+
+ACPI Thermal Relationship table interface
+------------------------------------------
+
+:file:`/dev/acpi_thermal_rel`
+
+ This device provides IOCTL interface to read standard ACPI
+ thermal relationship tables via ACPI methods _TRT and _ART.
+ These IOCTLs are defined in
+ drivers/thermal/intel/int340x_thermal/acpi_thermal_rel.h
+
+ IOCTLs:
+
+ ACPI_THERMAL_GET_TRT_LEN: Get length of TRT table
+
+ ACPI_THERMAL_GET_ART_LEN: Get length of ART table
+
+ ACPI_THERMAL_GET_TRT_COUNT: Number of records in TRT table
+
+ ACPI_THERMAL_GET_ART_COUNT: Number of records in ART table
+
+ ACPI_THERMAL_GET_TRT: Read binary TRT table, length to read is
+ provided via argument to ioctl().
+
+ ACPI_THERMAL_GET_ART: Read binary ART table, length to read is
+ provided via argument to ioctl().
+
+DPTF ACPI Sensor drivers
+-------------------------
+
+DPTF Sensor drivers are presented as standard thermal sysfs thermal_zone.
+
+
+DPTF ACPI Cooling drivers
+--------------------------
+
+DPTF cooling drivers are presented as standard thermal sysfs cooling_device.
+
+
+DPTF Processor thermal PCI Driver interface
+--------------------------------------------
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/power_limits/`
+
+Refer to Documentation/power/powercap/powercap.rst for powercap
+ABI.
+
+``power_limit_0_max_uw`` (RO)
+ Maximum powercap sysfs constraint_0_power_limit_uw for Intel RAPL
+
+``power_limit_0_step_uw`` (RO)
+ Power limit increment/decrements for Intel RAPL constraint 0 power limit
+
+``power_limit_0_min_uw`` (RO)
+ Minimum powercap sysfs constraint_0_power_limit_uw for Intel RAPL
+
+``power_limit_0_tmin_us`` (RO)
+ Minimum powercap sysfs constraint_0_time_window_us for Intel RAPL
+
+``power_limit_0_tmax_us`` (RO)
+ Maximum powercap sysfs constraint_0_time_window_us for Intel RAPL
+
+``power_limit_1_max_uw`` (RO)
+ Maximum powercap sysfs constraint_1_power_limit_uw for Intel RAPL
+
+``power_limit_1_step_uw`` (RO)
+ Power limit increment/decrements for Intel RAPL constraint 1 power limit
+
+``power_limit_1_min_uw`` (RO)
+ Minimum powercap sysfs constraint_1_power_limit_uw for Intel RAPL
+
+``power_limit_1_tmin_us`` (RO)
+ Minimum powercap sysfs constraint_1_time_window_us for Intel RAPL
+
+``power_limit_1_tmax_us`` (RO)
+ Maximum powercap sysfs constraint_1_time_window_us for Intel RAPL
+
+``power_floor_status`` (RO)
+ When set to 1, the power floor of the system in the current
+ configuration has been reached. It needs to be reconfigured to allow
+ power to be reduced any further.
+
+``power_floor_enable`` (RW)
+ When set to 1, enable reading and notification of the power floor
+ status. Notifications are triggered for the power_floor_status
+ attribute value changes.
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/`
+
+``tcc_offset_degree_celsius`` (RW)
+ TCC offset from the critical temperature where hardware will throttle
+ CPU.
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/workload_request`
+
+``workload_available_types`` (RO)
+ Available workload types. User space can specify one of the workload type
+ it is currently executing via workload_type. For example: idle, bursty,
+ sustained etc.
+
+``workload_type`` (RW)
+ User space can specify any one of the available workload type using
+ this interface.
+
+DPTF Processor thermal RFIM interface
+--------------------------------------------
+
+RFIM interface allows adjustment of FIVR (Fully Integrated Voltage Regulator),
+DDR (Double Data Rate) and DLVR (Digital Linear Voltage Regulator)
+frequencies to avoid RF interference with WiFi and 5G.
+
+Switching voltage regulators (VR) generate radiated EMI or RFI at the
+fundamental frequency and its harmonics. Some harmonics may interfere
+with very sensitive wireless receivers such as Wi-Fi and cellular that
+are integrated into host systems like notebook PCs. One of mitigation
+methods is requesting SOC integrated VR (IVR) switching frequency to a
+small % and shift away the switching noise harmonic interference from
+radio channels. OEM or ODMs can use the driver to control SOC IVR
+operation within the range where it does not impact IVR performance.
+
+Some products use DLVR instead of FIVR as switching voltage regulator.
+In this case attributes of DLVR must be adjusted instead of FIVR.
+
+While shifting the frequencies additional clock noise can be introduced,
+which is compensated by adjusting Spread spectrum percent. This helps
+to reduce the clock noise to meet regulatory compliance. This spreading
+% increases bandwidth of signal transmission and hence reduces the
+effects of interference, noise and signal fading.
+
+DRAM devices of DDR IO interface and their power plane can generate EMI
+at the data rates. Similar to IVR control mechanism, Intel offers a
+mechanism by which DDR data rates can be changed if several conditions
+are met: there is strong RFI interference because of DDR; CPU power
+management has no other restriction in changing DDR data rates;
+PC ODMs enable this feature (real time DDR RFI Mitigation referred to as
+DDR-RFIM) for Wi-Fi from BIOS.
+
+
+FIVR attributes
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/fivr/`
+
+``vco_ref_code_lo`` (RW)
+ The VCO reference code is an 11-bit field and controls the FIVR
+ switching frequency. This is the 3-bit LSB field.
+
+``vco_ref_code_hi`` (RW)
+ The VCO reference code is an 11-bit field and controls the FIVR
+ switching frequency. This is the 8-bit MSB field.
+
+``spread_spectrum_pct`` (RW)
+ Set the FIVR spread spectrum clocking percentage
+
+``spread_spectrum_clk_enable`` (RW)
+ Enable/disable of the FIVR spread spectrum clocking feature
+
+``rfi_vco_ref_code`` (RW)
+ This field is a read only status register which reflects the
+ current FIVR switching frequency
+
+``fivr_fffc_rev`` (RW)
+ This field indicated the revision of the FIVR HW.
+
+
+DVFS attributes
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/dvfs/`
+
+``rfi_restriction_run_busy`` (RW)
+ Request the restriction of specific DDR data rate and set this
+ value 1. Self reset to 0 after operation.
+
+``rfi_restriction_err_code`` (RW)
+ 0 :Request is accepted, 1:Feature disabled,
+ 2: the request restricts more points than it is allowed
+
+``rfi_restriction_data_rate_Delta`` (RW)
+ Restricted DDR data rate for RFI protection: Lower Limit
+
+``rfi_restriction_data_rate_Base`` (RW)
+ Restricted DDR data rate for RFI protection: Upper Limit
+
+``ddr_data_rate_point_0`` (RO)
+ DDR data rate selection 1st point
+
+``ddr_data_rate_point_1`` (RO)
+ DDR data rate selection 2nd point
+
+``ddr_data_rate_point_2`` (RO)
+ DDR data rate selection 3rd point
+
+``ddr_data_rate_point_3`` (RO)
+ DDR data rate selection 4th point
+
+``rfi_disable (RW)``
+ Disable DDR rate change feature
+
+DLVR attributes
+
+:file:`/sys/bus/pci/devices/0000\:00\:04.0/dlvr/`
+
+``dlvr_hardware_rev`` (RO)
+ DLVR hardware revision.
+
+``dlvr_freq_mhz`` (RO)
+ Current DLVR PLL frequency in MHz.
+
+``dlvr_freq_select`` (RW)
+ Sets DLVR PLL clock frequency. Once set, and enabled via
+ dlvr_rfim_enable, the dlvr_freq_mhz will show the current
+ DLVR PLL frequency.
+
+``dlvr_pll_busy`` (RO)
+ PLL can't accept frequency change when set.
+
+``dlvr_rfim_enable`` (RW)
+ 0: Disable RF frequency hopping, 1: Enable RF frequency hopping.
+
+``dlvr_spread_spectrum_pct`` (RW)
+ Sets DLVR spread spectrum percent value.
+
+``dlvr_control_mode`` (RW)
+ Specifies how frequencies are spread using spread spectrum.
+ 0: Down spread,
+ 1: Spread in the Center.
+
+``dlvr_control_lock`` (RW)
+ 1: future writes are ignored.
+
+DPTF Power supply and Battery Interface
+----------------------------------------
+
+Refer to Documentation/ABI/testing/sysfs-platform-dptf
+
+DPTF Fan Control
+----------------------------------------
+
+Refer to Documentation/admin-guide/acpi/fan_performance_states.rst
+
+Workload Type Hints
+----------------------------------------
+
+The firmware in Meteor Lake processor generation is capable of identifying
+workload type and passing hints regarding it to the OS. A special sysfs
+interface is provided to allow user space to obtain workload type hints from
+the firmware and control the rate at which they are provided.
+
+User space can poll attribute "workload_type_index" for the current hint or
+can receive a notification whenever the value of this attribute is updated.
+
+file:`/sys/bus/pci/devices/0000:00:04.0/workload_hint/`
+Segment 0, bus 0, device 4, function 0 is reserved for the processor thermal
+device on all Intel client processors. So, the above path doesn't change
+based on the processor generation.
+
+``workload_hint_enable`` (RW)
+ Enable firmware to send workload type hints to user space.
+
+``notification_delay_ms`` (RW)
+ Minimum delay in milliseconds before firmware will notify OS. This is
+ for the rate control of notifications. This delay is between changing
+ the workload type prediction in the firmware and notifying the OS about
+ the change. The default delay is 1024 ms. The delay of 0 is invalid.
+ The delay is rounded up to the nearest power of 2 to simplify firmware
+ programming of the delay value. The read of notification_delay_ms
+ attribute shows the effective value used.
+
+``workload_type_index`` (RO)
+ Predicted workload type index. User space can get notification of
+ change via existing sysfs attribute change notification mechanism.
+
+ The supported index values and their meaning for the Meteor Lake
+ processor generation are as follows:
+
+ 0 - Idle: System performs no tasks, power and idle residency are
+ consistently low for long periods of time.
+
+ 1 – Battery Life: Power is relatively low, but the processor may
+ still be actively performing a task, such as video playback for
+ a long period of time.
+
+ 2 – Sustained: Power level that is relatively high for a long period
+ of time, with very few to no periods of idleness, which will
+ eventually exhaust RAPL Power Limit 1 and 2.
+
+ 3 – Bursty: Consumes a relatively constant average amount of power, but
+ periods of relative idleness are interrupted by bursts of
+ activity. The bursts are relatively short and the periods of
+ relative idleness between them typically prevent RAPL Power
+ Limit 1 from being exhausted.
+
+ 4 – Unknown: Can't classify.
diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/driver-api/thermal/intel_powerclamp.rst
deleted file mode 100644
index 3f6dfb0b3ea6..000000000000
--- a/Documentation/driver-api/thermal/intel_powerclamp.rst
+++ /dev/null
@@ -1,320 +0,0 @@
-=======================
-Intel Powerclamp Driver
-=======================
-
-By:
- - Arjan van de Ven <arjan@linux.intel.com>
- - Jacob Pan <jacob.jun.pan@linux.intel.com>
-
-.. Contents:
-
- (*) Introduction
- - Goals and Objectives
-
- (*) Theory of Operation
- - Idle Injection
- - Calibration
-
- (*) Performance Analysis
- - Effectiveness and Limitations
- - Power vs Performance
- - Scalability
- - Calibration
- - Comparison with Alternative Techniques
-
- (*) Usage and Interfaces
- - Generic Thermal Layer (sysfs)
- - Kernel APIs (TBD)
-
-INTRODUCTION
-============
-
-Consider the situation where a system’s power consumption must be
-reduced at runtime, due to power budget, thermal constraint, or noise
-level, and where active cooling is not preferred. Software managed
-passive power reduction must be performed to prevent the hardware
-actions that are designed for catastrophic scenarios.
-
-Currently, P-states, T-states (clock modulation), and CPU offlining
-are used for CPU throttling.
-
-On Intel CPUs, C-states provide effective power reduction, but so far
-they’re only used opportunistically, based on workload. With the
-development of intel_powerclamp driver, the method of synchronizing
-idle injection across all online CPU threads was introduced. The goal
-is to achieve forced and controllable C-state residency.
-
-Test/Analysis has been made in the areas of power, performance,
-scalability, and user experience. In many cases, clear advantage is
-shown over taking the CPU offline or modulating the CPU clock.
-
-
-THEORY OF OPERATION
-===================
-
-Idle Injection
---------------
-
-On modern Intel processors (Nehalem or later), package level C-state
-residency is available in MSRs, thus also available to the kernel.
-
-These MSRs are::
-
- #define MSR_PKG_C2_RESIDENCY 0x60D
- #define MSR_PKG_C3_RESIDENCY 0x3F8
- #define MSR_PKG_C6_RESIDENCY 0x3F9
- #define MSR_PKG_C7_RESIDENCY 0x3FA
-
-If the kernel can also inject idle time to the system, then a
-closed-loop control system can be established that manages package
-level C-state. The intel_powerclamp driver is conceived as such a
-control system, where the target set point is a user-selected idle
-ratio (based on power reduction), and the error is the difference
-between the actual package level C-state residency ratio and the target idle
-ratio.
-
-Injection is controlled by high priority kernel threads, spawned for
-each online CPU.
-
-These kernel threads, with SCHED_FIFO class, are created to perform
-clamping actions of controlled duty ratio and duration. Each per-CPU
-thread synchronizes its idle time and duration, based on the rounding
-of jiffies, so accumulated errors can be prevented to avoid a jittery
-effect. Threads are also bound to the CPU such that they cannot be
-migrated, unless the CPU is taken offline. In this case, threads
-belong to the offlined CPUs will be terminated immediately.
-
-Running as SCHED_FIFO and relatively high priority, also allows such
-scheme to work for both preemptable and non-preemptable kernels.
-Alignment of idle time around jiffies ensures scalability for HZ
-values. This effect can be better visualized using a Perf timechart.
-The following diagram shows the behavior of kernel thread
-kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
-for a given "duration", then relinquishes the CPU to other tasks,
-until the next time interval.
-
-The NOHZ schedule tick is disabled during idle time, but interrupts
-are not masked. Tests show that the extra wakeups from scheduler tick
-have a dramatic impact on the effectiveness of the powerclamp driver
-on large scale systems (Westmere system with 80 processors).
-
-::
-
- CPU0
- ____________ ____________
- kidle_inject/0 | sleep | mwait | sleep |
- _________| |________| |_______
- duration
- CPU1
- ____________ ____________
- kidle_inject/1 | sleep | mwait | sleep |
- _________| |________| |_______
- ^
- |
- |
- roundup(jiffies, interval)
-
-Only one CPU is allowed to collect statistics and update global
-control parameters. This CPU is referred to as the controlling CPU in
-this document. The controlling CPU is elected at runtime, with a
-policy that favors BSP, taking into account the possibility of a CPU
-hot-plug.
-
-In terms of dynamics of the idle control system, package level idle
-time is considered largely as a non-causal system where its behavior
-cannot be based on the past or current input. Therefore, the
-intel_powerclamp driver attempts to enforce the desired idle time
-instantly as given input (target idle ratio). After injection,
-powerclamp monitors the actual idle for a given time window and adjust
-the next injection accordingly to avoid over/under correction.
-
-When used in a causal control system, such as a temperature control,
-it is up to the user of this driver to implement algorithms where
-past samples and outputs are included in the feedback. For example, a
-PID-based thermal controller can use the powerclamp driver to
-maintain a desired target temperature, based on integral and
-derivative gains of the past samples.
-
-
-
-Calibration
------------
-During scalability testing, it is observed that synchronized actions
-among CPUs become challenging as the number of cores grows. This is
-also true for the ability of a system to enter package level C-states.
-
-To make sure the intel_powerclamp driver scales well, online
-calibration is implemented. The goals for doing such a calibration
-are:
-
-a) determine the effective range of idle injection ratio
-b) determine the amount of compensation needed at each target ratio
-
-Compensation to each target ratio consists of two parts:
-
- a) steady state error compensation
- This is to offset the error occurring when the system can
- enter idle without extra wakeups (such as external interrupts).
-
- b) dynamic error compensation
- When an excessive amount of wakeups occurs during idle, an
- additional idle ratio can be added to quiet interrupts, by
- slowing down CPU activities.
-
-A debugfs file is provided for the user to examine compensation
-progress and results, such as on a Westmere system::
-
- [jacob@nex01 ~]$ cat
- /sys/kernel/debug/intel_powerclamp/powerclamp_calib
- controlling cpu: 0
- pct confidence steady dynamic (compensation)
- 0 0 0 0
- 1 1 0 0
- 2 1 1 0
- 3 3 1 0
- 4 3 1 0
- 5 3 1 0
- 6 3 1 0
- 7 3 1 0
- 8 3 1 0
- ...
- 30 3 2 0
- 31 3 2 0
- 32 3 1 0
- 33 3 2 0
- 34 3 1 0
- 35 3 2 0
- 36 3 1 0
- 37 3 2 0
- 38 3 1 0
- 39 3 2 0
- 40 3 3 0
- 41 3 1 0
- 42 3 2 0
- 43 3 1 0
- 44 3 1 0
- 45 3 2 0
- 46 3 3 0
- 47 3 0 0
- 48 3 2 0
- 49 3 3 0
-
-Calibration occurs during runtime. No offline method is available.
-Steady state compensation is used only when confidence levels of all
-adjacent ratios have reached satisfactory level. A confidence level
-is accumulated based on clean data collected at runtime. Data
-collected during a period without extra interrupts is considered
-clean.
-
-To compensate for excessive amounts of wakeup during idle, additional
-idle time is injected when such a condition is detected. Currently,
-we have a simple algorithm to double the injection ratio. A possible
-enhancement might be to throttle the offending IRQ, such as delaying
-EOI for level triggered interrupts. But it is a challenge to be
-non-intrusive to the scheduler or the IRQ core code.
-
-
-CPU Online/Offline
-------------------
-Per-CPU kernel threads are started/stopped upon receiving
-notifications of CPU hotplug activities. The intel_powerclamp driver
-keeps track of clamping kernel threads, even after they are migrated
-to other CPUs, after a CPU offline event.
-
-
-Performance Analysis
-====================
-This section describes the general performance data collected on
-multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
-
-Effectiveness and Limitations
------------------------------
-The maximum range that idle injection is allowed is capped at 50
-percent. As mentioned earlier, since interrupts are allowed during
-forced idle time, excessive interrupts could result in less
-effectiveness. The extreme case would be doing a ping -f to generated
-flooded network interrupts without much CPU acknowledgement. In this
-case, little can be done from the idle injection threads. In most
-normal cases, such as scp a large file, applications can be throttled
-by the powerclamp driver, since slowing down the CPU also slows down
-network protocol processing, which in turn reduces interrupts.
-
-When control parameters change at runtime by the controlling CPU, it
-may take an additional period for the rest of the CPUs to catch up
-with the changes. During this time, idle injection is out of sync,
-thus not able to enter package C- states at the expected ratio. But
-this effect is minor, in that in most cases change to the target
-ratio is updated much less frequently than the idle injection
-frequency.
-
-Scalability
------------
-Tests also show a minor, but measurable, difference between the 4P/8P
-Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
-More compensation is needed on Westmere for the same amount of
-target idle ratio. The compensation also increases as the idle ratio
-gets larger. The above reason constitutes the need for the
-calibration code.
-
-On the IVB 8P system, compared to an offline CPU, powerclamp can
-achieve up to 40% better performance per watt. (measured by a spin
-counter summed over per CPU counting threads spawned for all running
-CPUs).
-
-Usage and Interfaces
-====================
-The powerclamp driver is registered to the generic thermal layer as a
-cooling device. Currently, it’s not bound to any thermal zones::
-
- jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
- cur_state:0
- max_state:50
- type:intel_powerclamp
-
-cur_state allows user to set the desired idle percentage. Writing 0 to
-cur_state will stop idle injection. Writing a value between 1 and
-max_state will start the idle injection. Reading cur_state returns the
-actual and current idle percentage. This may not be the same value
-set by the user in that current idle percentage depends on workload
-and includes natural idle. When idle injection is disabled, reading
-cur_state returns value -1 instead of 0 which is to avoid confusing
-100% busy state with the disabled state.
-
-Example usage:
-- To inject 25% idle time::
-
- $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
-
-If the system is not busy and has more than 25% idle time already,
-then the powerclamp driver will not start idle injection. Using Top
-will not show idle injection kernel threads.
-
-If the system is busy (spin test below) and has less than 25% natural
-idle time, powerclamp kernel threads will do idle injection. Forced
-idle time is accounted as normal idle in that common code path is
-taken as the idle task.
-
-In this example, 24.1% idle is shown. This helps the system admin or
-user determine the cause of slowdown, when a powerclamp driver is in action::
-
-
- Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie
- Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
- Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers
- Swap: 4087804k total, 0k used, 4087804k free, 945336k cached
-
- PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
- 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin
- 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0
- 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3
- 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1
- 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2
- 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox
- 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg
- 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz
-
-Tests have shown that by using the powerclamp driver as a cooling
-device, a PID based userspace thermal controller can manage to
-control CPU temperature effectively, when no other thermal influence
-is added. For example, a UltraBook user can compile the kernel under
-certain temperature (below most active trip points).
diff --git a/Documentation/driver-api/thermal/nouveau_thermal.rst b/Documentation/driver-api/thermal/nouveau_thermal.rst
index 37255fd6735d..aa10db6df309 100644
--- a/Documentation/driver-api/thermal/nouveau_thermal.rst
+++ b/Documentation/driver-api/thermal/nouveau_thermal.rst
@@ -90,7 +90,7 @@ Bug reports
-----------
Thermal management on Nouveau is new and may not work on all cards. If you have
-inquiries, please ping mupuf on IRC (#nouveau, freenode).
+inquiries, please ping mupuf on IRC (#nouveau, OFTC).
Bug reports should be filled on Freedesktop's bug tracker. Please follow
-http://nouveau.freedesktop.org/wiki/Bugs
+https://nouveau.freedesktop.org/wiki/Bugs
diff --git a/Documentation/driver-api/thermal/power_allocator.rst b/Documentation/driver-api/thermal/power_allocator.rst
index 67b6a3297238..aa5f66552d6f 100644
--- a/Documentation/driver-api/thermal/power_allocator.rst
+++ b/Documentation/driver-api/thermal/power_allocator.rst
@@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore
simply an estimate, and may be tuned to affect the aggressiveness of
the thermal ramp. For reference, the sustainable power of a 4" phone
is typically 2000mW, while on a 10" tablet is around 4500mW (may vary
-depending on screen size).
+depending on screen size). It is possible to have the power value
+expressed in an abstract scale. The sustained power should be aligned
+to the scale used by the related cooling devices.
If you are using device tree, do add it as a property of the
thermal-zone. For example::
@@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this
governor, step-wise will also misbehave if you call its throttle()
faster than the normal thermal framework tick (due to interrupts for
example) as it will overreact.
+
+Energy Model requirements
+=========================
+
+Another important thing is the consistent scale of the power values
+provided by the cooling devices. All of the cooling devices in a single
+thermal zone should have power values reported either in milli-Watts
+or scaled to the same 'abstract scale'.
diff --git a/Documentation/driver-api/thermal/sysfs-api.rst b/Documentation/driver-api/thermal/sysfs-api.rst
index b40b1f839148..6c1175c6afba 100644
--- a/Documentation/driver-api/thermal/sysfs-api.rst
+++ b/Documentation/driver-api/thermal/sysfs-api.rst
@@ -54,7 +54,7 @@ temperature) and throttle appropriate devices.
trips:
the total number of trip points this thermal zone supports.
mask:
- Bit string: If 'n'th bit is set, then trip point 'n' is writeable.
+ Bit string: If 'n'th bit is set, then trip point 'n' is writable.
devdata:
device private data
ops:
@@ -306,42 +306,6 @@ temperature) and throttle appropriate devices.
::
- struct thermal_bind_params
-
- This structure defines the following parameters that are used to bind
- a zone with a cooling device for a particular trip point.
-
- .cdev:
- The cooling device pointer
- .weight:
- The 'influence' of a particular cooling device on this
- zone. This is relative to the rest of the cooling
- devices. For example, if all cooling devices have a
- weight of 1, then they all contribute the same. You can
- use percentages if you want, but it's not mandatory. A
- weight of 0 means that this cooling device doesn't
- contribute to the cooling of this zone unless all cooling
- devices have a weight of 0. If all weights are 0, then
- they all contribute the same.
- .trip_mask:
- This is a bit mask that gives the binding relation between
- this thermal zone and cdev, for a particular trip point.
- If nth bit is set, then the cdev and thermal zone are bound
- for trip point n.
- .binding_limits:
- This is an array of cooling state limits. Must have
- exactly 2 * thermal_zone.number_of_trip_points. It is an
- array consisting of tuples <lower-state upper-state> of
- state limits. Each trip will be associated with one state
- limit tuple when binding. A NULL pointer means
- <THERMAL_NO_LIMITS THERMAL_NO_LIMITS> on all trips.
- These limits are used when binding a cdev to a trip point.
- .match:
- This call back returns success(0) if the 'tz and cdev' need to
- be bound, as per platform data.
-
- ::
-
struct thermal_zone_params
This structure defines the platform level parameters for a thermal zone.
@@ -357,10 +321,6 @@ temperature) and throttle appropriate devices.
will be created. when no_hwmon == true, nothing will be done.
In case the thermal_zone_params is NULL, the hwmon interface
will be created (for backward compatibility).
- .num_tbps:
- Number of thermal_bind_params entries for this zone
- .tbp:
- thermal_bind_params entries
2. sysfs attributes structure
=============================
@@ -406,7 +366,7 @@ Thermal cooling device sys I/F, created once it's registered::
|---stats/reset: Writing any value resets the statistics
|---stats/time_in_state_ms: Time (msec) spent in various cooling states
|---stats/total_trans: Total number of times cooling state is changed
- |---stats/trans_table: Cooing state transition table
+ |---stats/trans_table: Cooling state transition table
Then next two dynamic attributes are created/removed in pairs. They represent
@@ -428,6 +388,9 @@ of thermal zone device. E.g. the generic thermal driver registers one hwmon
class device and build the associated hwmon sysfs I/F for all the registered
ACPI thermal zones.
+Please read Documentation/ABI/testing/sysfs-class-thermal for thermal
+zone and cooling device attribute details.
+
::
/sys/class/hwmon/hwmon[0-*]:
@@ -437,242 +400,6 @@ ACPI thermal zones.
Please read Documentation/hwmon/sysfs-interface.rst for additional information.
-Thermal zone attributes
------------------------
-
-type
- Strings which represent the thermal zone type.
- This is given by thermal zone driver as part of registration.
- E.g: "acpitz" indicates it's an ACPI thermal device.
- In order to keep it consistent with hwmon sys attribute; this should
- be a short, lowercase string, not containing spaces nor dashes.
- RO, Required
-
-temp
- Current temperature as reported by thermal zone (sensor).
- Unit: millidegree Celsius
- RO, Required
-
-mode
- One of the predefined values in [enabled, disabled].
- This file gives information about the algorithm that is currently
- managing the thermal zone. It can be either default kernel based
- algorithm or user space application.
-
- enabled
- enable Kernel Thermal management.
- disabled
- Preventing kernel thermal zone driver actions upon
- trip points so that user application can take full
- charge of the thermal management.
-
- RW, Optional
-
-policy
- One of the various thermal governors used for a particular zone.
-
- RW, Required
-
-available_policies
- Available thermal governors which can be used for a particular zone.
-
- RO, Required
-
-`trip_point_[0-*]_temp`
- The temperature above which trip point will be fired.
-
- Unit: millidegree Celsius
-
- RO, Optional
-
-`trip_point_[0-*]_type`
- Strings which indicate the type of the trip point.
-
- E.g. it can be one of critical, hot, passive, `active[0-*]` for ACPI
- thermal zone.
-
- RO, Optional
-
-`trip_point_[0-*]_hyst`
- The hysteresis value for a trip point, represented as an integer
- Unit: Celsius
- RW, Optional
-
-`cdev[0-*]`
- Sysfs link to the thermal cooling device node where the sys I/F
- for cooling device throttling control represents.
-
- RO, Optional
-
-`cdev[0-*]_trip_point`
- The trip point in this thermal zone which `cdev[0-*]` is associated
- with; -1 means the cooling device is not associated with any trip
- point.
-
- RO, Optional
-
-`cdev[0-*]_weight`
- The influence of `cdev[0-*]` in this thermal zone. This value
- is relative to the rest of cooling devices in the thermal
- zone. For example, if a cooling device has a weight double
- than that of other, it's twice as effective in cooling the
- thermal zone.
-
- RW, Optional
-
-passive
- Attribute is only present for zones in which the passive cooling
- policy is not supported by native thermal driver. Default is zero
- and can be set to a temperature (in millidegrees) to enable a
- passive trip point for the zone. Activation is done by polling with
- an interval of 1 second.
-
- Unit: millidegrees Celsius
-
- Valid values: 0 (disabled) or greater than 1000
-
- RW, Optional
-
-emul_temp
- Interface to set the emulated temperature method in thermal zone
- (sensor). After setting this temperature, the thermal zone may pass
- this temperature to platform emulation function if registered or
- cache it locally. This is useful in debugging different temperature
- threshold and its associated cooling action. This is write only node
- and writing 0 on this node should disable emulation.
- Unit: millidegree Celsius
-
- WO, Optional
-
- WARNING:
- Be careful while enabling this option on production systems,
- because userland can easily disable the thermal policy by simply
- flooding this sysfs node with low temperature values.
-
-sustainable_power
- An estimate of the sustained power that can be dissipated by
- the thermal zone. Used by the power allocator governor. For
- more information see Documentation/driver-api/thermal/power_allocator.rst
-
- Unit: milliwatts
-
- RW, Optional
-
-k_po
- The proportional term of the power allocator governor's PID
- controller during temperature overshoot. Temperature overshoot
- is when the current temperature is above the "desired
- temperature" trip point. For more information see
- Documentation/driver-api/thermal/power_allocator.rst
-
- RW, Optional
-
-k_pu
- The proportional term of the power allocator governor's PID
- controller during temperature undershoot. Temperature undershoot
- is when the current temperature is below the "desired
- temperature" trip point. For more information see
- Documentation/driver-api/thermal/power_allocator.rst
-
- RW, Optional
-
-k_i
- The integral term of the power allocator governor's PID
- controller. This term allows the PID controller to compensate
- for long term drift. For more information see
- Documentation/driver-api/thermal/power_allocator.rst
-
- RW, Optional
-
-k_d
- The derivative term of the power allocator governor's PID
- controller. For more information see
- Documentation/driver-api/thermal/power_allocator.rst
-
- RW, Optional
-
-integral_cutoff
- Temperature offset from the desired temperature trip point
- above which the integral term of the power allocator
- governor's PID controller starts accumulating errors. For
- example, if integral_cutoff is 0, then the integral term only
- accumulates error when temperature is above the desired
- temperature trip point. For more information see
- Documentation/driver-api/thermal/power_allocator.rst
-
- Unit: millidegree Celsius
-
- RW, Optional
-
-slope
- The slope constant used in a linear extrapolation model
- to determine a hotspot temperature based off the sensor's
- raw readings. It is up to the device driver to determine
- the usage of these values.
-
- RW, Optional
-
-offset
- The offset constant used in a linear extrapolation model
- to determine a hotspot temperature based off the sensor's
- raw readings. It is up to the device driver to determine
- the usage of these values.
-
- RW, Optional
-
-Cooling device attributes
--------------------------
-
-type
- String which represents the type of device, e.g:
-
- - for generic ACPI: should be "Fan", "Processor" or "LCD"
- - for memory controller device on intel_menlow platform:
- should be "Memory controller".
-
- RO, Required
-
-max_state
- The maximum permissible cooling state of this cooling device.
-
- RO, Required
-
-cur_state
- The current cooling state of this cooling device.
- The value can any integer numbers between 0 and max_state:
-
- - cur_state == 0 means no cooling
- - cur_state == max_state means the maximum cooling.
-
- RW, Required
-
-stats/reset
- Writing any value resets the cooling device's statistics.
- WO, Required
-
-stats/time_in_state_ms:
- The amount of time spent by the cooling device in various cooling
- states. The output will have "<state> <time>" pair in each line, which
- will mean this cooling device spent <time> msec of time at <state>.
- Output will have one line for each of the supported states. usertime
- units here is 10mS (similar to other time exported in /proc).
- RO, Required
-
-
-stats/total_trans:
- A single positive value showing the total number of times the state of a
- cooling device is changed.
-
- RO, Required
-
-stats/trans_table:
- This gives fine grained information about all the cooling state
- transitions. The cat output here is a two dimensional matrix, where an
- entry <i,j> (row i, column j) represents the number of transitions from
- State_i to State_j. If the transition table is bigger than PAGE_SIZE,
- reading this will return an -EFBIG error.
- RO, Required
-
3. A simple implementation
==========================
@@ -744,17 +471,7 @@ This function returns the thermal_instance corresponding to a given
{thermal_zone, cooling_device, trip_point} combination. Returns NULL
if such an instance does not exist.
-4.3. thermal_notify_framework
------------------------------
-
-This function handles the trip events from sensor drivers. It starts
-throttling the cooling devices according to the policy configured.
-For CRITICAL and HOT trip points, this notifies the respective drivers,
-and does actual throttling for other trip points i.e ACTIVE and PASSIVE.
-The throttling policy is based on the configured platform data; if no
-platform data is provided, this uses the step_wise throttling policy.
-
-4.4. thermal_cdev_update
+4.3. thermal_cdev_update
------------------------
This function serves as an arbitrator to set the state of a cooling
@@ -764,21 +481,15 @@ possible.
5. thermal_emergency_poweroff
=============================
-On an event of critical trip temperature crossing. Thermal framework
-allows the system to shutdown gracefully by calling orderly_poweroff().
-In the event of a failure of orderly_poweroff() to shut down the system
-we are in danger of keeping the system alive at undesirably high
-temperatures. To mitigate this high risk scenario we program a work
-queue to fire after a pre-determined number of seconds to start
-an emergency shutdown of the device using the kernel_power_off()
-function. In case kernel_power_off() fails then finally
-emergency_restart() is called in the worst case.
+On an event of critical trip temperature crossing the thermal framework
+shuts down the system by calling hw_protection_shutdown(). The
+hw_protection_shutdown() first attempts to perform an orderly shutdown
+but accepts a delay after which it proceeds doing a forced power-off
+or as last resort an emergency_restart.
The delay should be carefully profiled so as to give adequate time for
-orderly_poweroff(). In case of failure of an orderly_poweroff() the
-emergency poweroff kicks in after the delay has elapsed and shuts down
-the system.
+orderly poweroff.
-If set to 0 emergency poweroff will not be supported. So a carefully
-profiled non-zero positive value is a must for emergerncy poweroff to be
-triggered.
+If the delay is set to 0 emergency poweroff will not be supported. So a
+carefully profiled non-zero positive value is a must for emergency
+poweroff to be triggered.