diff options
Diffstat (limited to 'Documentation/driver-api/thermal')
-rw-r--r-- | Documentation/driver-api/thermal/cpu-idle-cooling.rst | 199 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/exynos_thermal.rst | 8 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/index.rst | 3 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/intel_dptf.rst | 381 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/intel_powerclamp.rst | 320 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/nouveau_thermal.rst | 4 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/power_allocator.rst | 12 | ||||
-rw-r--r-- | Documentation/driver-api/thermal/sysfs-api.rst | 319 |
8 files changed, 614 insertions, 632 deletions
diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst new file mode 100644 index 000000000000..c2a7ca676853 --- /dev/null +++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst @@ -0,0 +1,199 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +CPU Idle Cooling +================ + +Situation: +---------- + +Under certain circumstances a SoC can reach a critical temperature +limit and is unable to stabilize the temperature around a temperature +control. When the SoC has to stabilize the temperature, the kernel can +act on a cooling device to mitigate the dissipated power. When the +critical temperature is reached, a decision must be taken to reduce +the temperature, that, in turn impacts performance. + +Another situation is when the silicon temperature continues to +increase even after the dynamic leakage is reduced to its minimum by +clock gating the component. This runaway phenomenon can continue due +to the static leakage. The only solution is to power down the +component, thus dropping the dynamic and static leakage that will +allow the component to cool down. + +Last but not least, the system can ask for a specific power budget but +because of the OPP density, we can only choose an OPP with a power +budget lower than the requested one and under-utilize the CPU, thus +losing performance. In other words, one OPP under-utilizes the CPU +with a power less than the requested power budget and the next OPP +exceeds the power budget. An intermediate OPP could have been used if +it were present. + +Solutions: +---------- + +If we can remove the static and the dynamic leakage for a specific +duration in a controlled period, the SoC temperature will +decrease. Acting on the idle state duration or the idle cycle +injection period, we can mitigate the temperature by modulating the +power budget. + +The Operating Performance Point (OPP) density has a great influence on +the control precision of cpufreq, however different vendors have a +plethora of OPP density, and some have large power gap between OPPs, +that will result in loss of performance during thermal control and +loss of power in other scenarios. + +At a specific OPP, we can assume that injecting idle cycle on all CPUs +belong to the same cluster, with a duration greater than the cluster +idle state target residency, we lead to dropping the static and the +dynamic leakage for this period (modulo the energy needed to enter +this state). So the sustainable power with idle cycles has a linear +relation with the OPP’s sustainable power and can be computed with a +coefficient similar to:: + + Power(IdleCycle) = Coef x Power(OPP) + +Idle Injection: +--------------- + +The base concept of the idle injection is to force the CPU to go to an +idle state for a specified time each control cycle, it provides +another way to control CPU power and heat in addition to +cpufreq. Ideally, if all CPUs belonging to the same cluster, inject +their idle cycles synchronously, the cluster can reach its power down +state with a minimum power consumption and reduce the static leakage +to almost zero. However, these idle cycles injection will add extra +latencies as the CPUs will have to wakeup from a deep sleep state. + +We use a fixed duration of idle injection that gives an acceptable +performance penalty and a fixed latency. Mitigation can be increased +or decreased by modulating the duty cycle of the idle injection. + +:: + + ^ + | + | + |------- ------- + |_______|_______________________|_______|___________ + + <------> + idle <----------------------> + running + + <-----------------------------> + duty cycle 25% + + +The implementation of the cooling device bases the number of states on +the duty cycle percentage. When no mitigation is happening the cooling +device state is zero, meaning the duty cycle is 0%. + +When the mitigation begins, depending on the governor's policy, a +starting state is selected. With a fixed idle duration and the duty +cycle (aka the cooling device state), the running duration can be +computed. + +The governor will change the cooling device state thus the duty cycle +and this variation will modulate the cooling effect. + +:: + + ^ + | + | + |------- ------- + |_______|_______________|_______|___________ + + <------> + idle <--------------> + running + + <---------------------> + duty cycle 33% + + + ^ + | + | + |------- ------- + |_______|_______|_______|___________ + + <------> + idle <------> + running + + <-------------> + duty cycle 50% + +The idle injection duration value must comply with the constraints: + +- It is less than or equal to the latency we tolerate when the + mitigation begins. It is platform dependent and will depend on the + user experience, reactivity vs performance trade off we want. This + value should be specified. + +- It is greater than the idle state’s target residency we want to go + for thermal mitigation, otherwise we end up consuming more energy. + +Power considerations +-------------------- + +When we reach the thermal trip point, we have to sustain a specified +power for a specific temperature but at this time we consume:: + + Power = Capacitance x Voltage^2 x Frequency x Utilisation + +... which is more than the sustainable power (or there is something +wrong in the system setup). The ‘Capacitance’ and ‘Utilisation’ are a +fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially +because we don’t want to change the OPP. We can group the +‘Capacitance’ and the ‘Utilisation’ into a single term which is the +‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have:: + + Pdyn = Cdyn x Voltage^2 x Frequency + +The power allocator governor will ask us somehow to reduce our power +in order to target the sustainable power defined in the device +tree. So with the idle injection mechanism, we want an average power +(Ptarget) resulting in an amount of time running at full power on a +specific OPP and idle another amount of time. That could be put in a +equation:: + + P(opp)target = ((Trunning x (P(opp)running) + (Tidle x P(opp)idle)) / + (Trunning + Tidle) + + ... + + Tidle = Trunning x ((P(opp)running / P(opp)target) - 1) + +At this point if we know the running period for the CPU, that gives us +the idle injection we need. Alternatively if we have the idle +injection duration, we can compute the running duration with:: + + Trunning = Tidle / ((P(opp)running / P(opp)target) - 1) + +Practically, if the running power is less than the targeted power, we +end up with a negative time value, so obviously the equation usage is +bound to a power reduction, hence a higher OPP is needed to have the +running power greater than the targeted power. + +However, in this demonstration we ignore three aspects: + + * The static leakage is not defined here, we can introduce it in the + equation but assuming it will be zero most of the time as it is + difficult to get the values from the SoC vendors + + * The idle state wake up latency (or entry + exit latency) is not + taken into account, it must be added in the equation in order to + rigorously compute the idle injection + + * The injected idle duration must be greater than the idle state + target residency, otherwise we end up consuming more energy and + potentially invert the mitigation effect + +So the final equation is:: + + Trunning = (Tidle - Twakeup ) x + (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target ) diff --git a/Documentation/driver-api/thermal/exynos_thermal.rst b/Documentation/driver-api/thermal/exynos_thermal.rst index 5bd556566c70..764df4ab584d 100644 --- a/Documentation/driver-api/thermal/exynos_thermal.rst +++ b/Documentation/driver-api/thermal/exynos_thermal.rst @@ -4,7 +4,7 @@ Kernel driver exynos_tmu Supported chips: -* ARM SAMSUNG EXYNOS4, EXYNOS5 series of SoC +* ARM Samsung Exynos4, Exynos5 series of SoC Datasheet: Not publicly available @@ -14,7 +14,7 @@ Authors: Amit Daniel <amit.daniel@samsung.com> TMU controller Description: --------------------------- -This driver allows to read temperature inside SAMSUNG EXYNOS4/5 series of SoC. +This driver allows to read temperature inside Samsung Exynos4/5 series of SoC. The chip only exposes the measured 8-bit temperature code value through a register. @@ -43,7 +43,7 @@ The three equations are: Trimming info for 85 degree Celsius (stored at TRIMINFO register) Temperature code measured at 85 degree Celsius which is unchanged -TMU(Thermal Management Unit) in EXYNOS4/5 generates interrupt +TMU(Thermal Management Unit) in Exynos4/5 generates interrupt when temperature exceeds pre-defined levels. The maximum number of configurable threshold is five. The threshold levels are defined as follows:: @@ -67,7 +67,7 @@ TMU driver description: The exynos thermal driver is structured as:: Kernel Core thermal framework - (thermal_core.c, step_wise.c, cpu_cooling.c) + (thermal_core.c, step_wise.c, cpufreq_cooling.c) ^ | | diff --git a/Documentation/driver-api/thermal/index.rst b/Documentation/driver-api/thermal/index.rst index 5ba61d19c6ae..a886028014ab 100644 --- a/Documentation/driver-api/thermal/index.rst +++ b/Documentation/driver-api/thermal/index.rst @@ -8,11 +8,12 @@ Thermal :maxdepth: 1 cpu-cooling-api + cpu-idle-cooling sysfs-api power_allocator exynos_thermal exynos_thermal_emulation - intel_powerclamp nouveau_thermal x86_pkg_temperature_thermal + intel_dptf diff --git a/Documentation/driver-api/thermal/intel_dptf.rst b/Documentation/driver-api/thermal/intel_dptf.rst new file mode 100644 index 000000000000..8fb8c5b2d685 --- /dev/null +++ b/Documentation/driver-api/thermal/intel_dptf.rst @@ -0,0 +1,381 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================================== +Intel(R) Dynamic Platform and Thermal Framework Sysfs Interface +=============================================================== + +:Copyright: © 2022 Intel Corporation + +:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> + +Introduction +------------ + +Intel(R) Dynamic Platform and Thermal Framework (DPTF) is a platform +level hardware/software solution for power and thermal management. + +As a container for multiple power/thermal technologies, DPTF provides +a coordinated approach for different policies to effect the hardware +state of a system. + +Since it is a platform level framework, this has several components. +Some parts of the technology is implemented in the firmware and uses +ACPI and PCI devices to expose various features for monitoring and +control. Linux has a set of kernel drivers exposing hardware interface +to user space. This allows user space thermal solutions like +"Linux Thermal Daemon" to read platform specific thermal and power +tables to deliver adequate performance while keeping the system under +thermal limits. + +DPTF ACPI Drivers interface +---------------------------- + +:file:`/sys/bus/platform/devices/<N>/uuids`, where <N> +=INT3400|INTC1040|INTC1041|INTC10A0 + +``available_uuids`` (RO) + A set of UUIDs strings presenting available policies + which should be notified to the firmware when the + user space can support those policies. + + UUID strings: + + "42A441D6-AE6A-462b-A84B-4A8CE79027D3" : Passive 1 + + "3A95C389-E4B8-4629-A526-C52C88626BAE" : Active + + "97C68AE7-15FA-499c-B8C9-5DA81D606E0A" : Critical + + "63BE270F-1C11-48FD-A6F7-3AF253FF3E2D" : Adaptive performance + + "5349962F-71E6-431D-9AE8-0A635B710AEE" : Emergency call + + "9E04115A-AE87-4D1C-9500-0F3E340BFE75" : Passive 2 + + "F5A35014-C209-46A4-993A-EB56DE7530A1" : Power Boss + + "6ED722A7-9240-48A5-B479-31EEF723D7CF" : Virtual Sensor + + "16CAF1B7-DD38-40ED-B1C1-1B8A1913D531" : Cooling mode + + "BE84BABF-C4D4-403D-B495-3128FD44dAC1" : HDC + +``current_uuid`` (RW) + User space can write strings from available UUIDs, one at a + time. + +:file:`/sys/bus/platform/devices/<N>/`, where <N> +=INT3400|INTC1040|INTC1041|INTC10A0 + +``imok`` (WO) + User space daemon write 1 to respond to firmware event + for sending keep alive notification. User space receives + THERMAL_EVENT_KEEP_ALIVE kobject uevent notification when + firmware calls for user space to respond with imok ACPI + method. + +``odvp*`` (RO) + Firmware thermal status variable values. Thermal tables + calls for different processing based on these variable + values. + +``data_vault`` (RO) + Binary thermal table. Refer to + https:/github.com/intel/thermal_daemon for decoding + thermal table. + +``production_mode`` (RO) + When different from zero, manufacturer locked thermal configuration + from further changes. + +ACPI Thermal Relationship table interface +------------------------------------------ + +:file:`/dev/acpi_thermal_rel` + + This device provides IOCTL interface to read standard ACPI + thermal relationship tables via ACPI methods _TRT and _ART. + These IOCTLs are defined in + drivers/thermal/intel/int340x_thermal/acpi_thermal_rel.h + + IOCTLs: + + ACPI_THERMAL_GET_TRT_LEN: Get length of TRT table + + ACPI_THERMAL_GET_ART_LEN: Get length of ART table + + ACPI_THERMAL_GET_TRT_COUNT: Number of records in TRT table + + ACPI_THERMAL_GET_ART_COUNT: Number of records in ART table + + ACPI_THERMAL_GET_TRT: Read binary TRT table, length to read is + provided via argument to ioctl(). + + ACPI_THERMAL_GET_ART: Read binary ART table, length to read is + provided via argument to ioctl(). + +DPTF ACPI Sensor drivers +------------------------- + +DPTF Sensor drivers are presented as standard thermal sysfs thermal_zone. + + +DPTF ACPI Cooling drivers +-------------------------- + +DPTF cooling drivers are presented as standard thermal sysfs cooling_device. + + +DPTF Processor thermal PCI Driver interface +-------------------------------------------- + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/power_limits/` + +Refer to Documentation/power/powercap/powercap.rst for powercap +ABI. + +``power_limit_0_max_uw`` (RO) + Maximum powercap sysfs constraint_0_power_limit_uw for Intel RAPL + +``power_limit_0_step_uw`` (RO) + Power limit increment/decrements for Intel RAPL constraint 0 power limit + +``power_limit_0_min_uw`` (RO) + Minimum powercap sysfs constraint_0_power_limit_uw for Intel RAPL + +``power_limit_0_tmin_us`` (RO) + Minimum powercap sysfs constraint_0_time_window_us for Intel RAPL + +``power_limit_0_tmax_us`` (RO) + Maximum powercap sysfs constraint_0_time_window_us for Intel RAPL + +``power_limit_1_max_uw`` (RO) + Maximum powercap sysfs constraint_1_power_limit_uw for Intel RAPL + +``power_limit_1_step_uw`` (RO) + Power limit increment/decrements for Intel RAPL constraint 1 power limit + +``power_limit_1_min_uw`` (RO) + Minimum powercap sysfs constraint_1_power_limit_uw for Intel RAPL + +``power_limit_1_tmin_us`` (RO) + Minimum powercap sysfs constraint_1_time_window_us for Intel RAPL + +``power_limit_1_tmax_us`` (RO) + Maximum powercap sysfs constraint_1_time_window_us for Intel RAPL + +``power_floor_status`` (RO) + When set to 1, the power floor of the system in the current + configuration has been reached. It needs to be reconfigured to allow + power to be reduced any further. + +``power_floor_enable`` (RW) + When set to 1, enable reading and notification of the power floor + status. Notifications are triggered for the power_floor_status + attribute value changes. + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/` + +``tcc_offset_degree_celsius`` (RW) + TCC offset from the critical temperature where hardware will throttle + CPU. + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/workload_request` + +``workload_available_types`` (RO) + Available workload types. User space can specify one of the workload type + it is currently executing via workload_type. For example: idle, bursty, + sustained etc. + +``workload_type`` (RW) + User space can specify any one of the available workload type using + this interface. + +DPTF Processor thermal RFIM interface +-------------------------------------------- + +RFIM interface allows adjustment of FIVR (Fully Integrated Voltage Regulator), +DDR (Double Data Rate) and DLVR (Digital Linear Voltage Regulator) +frequencies to avoid RF interference with WiFi and 5G. + +Switching voltage regulators (VR) generate radiated EMI or RFI at the +fundamental frequency and its harmonics. Some harmonics may interfere +with very sensitive wireless receivers such as Wi-Fi and cellular that +are integrated into host systems like notebook PCs. One of mitigation +methods is requesting SOC integrated VR (IVR) switching frequency to a +small % and shift away the switching noise harmonic interference from +radio channels. OEM or ODMs can use the driver to control SOC IVR +operation within the range where it does not impact IVR performance. + +Some products use DLVR instead of FIVR as switching voltage regulator. +In this case attributes of DLVR must be adjusted instead of FIVR. + +While shifting the frequencies additional clock noise can be introduced, +which is compensated by adjusting Spread spectrum percent. This helps +to reduce the clock noise to meet regulatory compliance. This spreading +% increases bandwidth of signal transmission and hence reduces the +effects of interference, noise and signal fading. + +DRAM devices of DDR IO interface and their power plane can generate EMI +at the data rates. Similar to IVR control mechanism, Intel offers a +mechanism by which DDR data rates can be changed if several conditions +are met: there is strong RFI interference because of DDR; CPU power +management has no other restriction in changing DDR data rates; +PC ODMs enable this feature (real time DDR RFI Mitigation referred to as +DDR-RFIM) for Wi-Fi from BIOS. + + +FIVR attributes + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/fivr/` + +``vco_ref_code_lo`` (RW) + The VCO reference code is an 11-bit field and controls the FIVR + switching frequency. This is the 3-bit LSB field. + +``vco_ref_code_hi`` (RW) + The VCO reference code is an 11-bit field and controls the FIVR + switching frequency. This is the 8-bit MSB field. + +``spread_spectrum_pct`` (RW) + Set the FIVR spread spectrum clocking percentage + +``spread_spectrum_clk_enable`` (RW) + Enable/disable of the FIVR spread spectrum clocking feature + +``rfi_vco_ref_code`` (RW) + This field is a read only status register which reflects the + current FIVR switching frequency + +``fivr_fffc_rev`` (RW) + This field indicated the revision of the FIVR HW. + + +DVFS attributes + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/dvfs/` + +``rfi_restriction_run_busy`` (RW) + Request the restriction of specific DDR data rate and set this + value 1. Self reset to 0 after operation. + +``rfi_restriction_err_code`` (RW) + 0 :Request is accepted, 1:Feature disabled, + 2: the request restricts more points than it is allowed + +``rfi_restriction_data_rate_Delta`` (RW) + Restricted DDR data rate for RFI protection: Lower Limit + +``rfi_restriction_data_rate_Base`` (RW) + Restricted DDR data rate for RFI protection: Upper Limit + +``ddr_data_rate_point_0`` (RO) + DDR data rate selection 1st point + +``ddr_data_rate_point_1`` (RO) + DDR data rate selection 2nd point + +``ddr_data_rate_point_2`` (RO) + DDR data rate selection 3rd point + +``ddr_data_rate_point_3`` (RO) + DDR data rate selection 4th point + +``rfi_disable (RW)`` + Disable DDR rate change feature + +DLVR attributes + +:file:`/sys/bus/pci/devices/0000\:00\:04.0/dlvr/` + +``dlvr_hardware_rev`` (RO) + DLVR hardware revision. + +``dlvr_freq_mhz`` (RO) + Current DLVR PLL frequency in MHz. + +``dlvr_freq_select`` (RW) + Sets DLVR PLL clock frequency. Once set, and enabled via + dlvr_rfim_enable, the dlvr_freq_mhz will show the current + DLVR PLL frequency. + +``dlvr_pll_busy`` (RO) + PLL can't accept frequency change when set. + +``dlvr_rfim_enable`` (RW) + 0: Disable RF frequency hopping, 1: Enable RF frequency hopping. + +``dlvr_spread_spectrum_pct`` (RW) + Sets DLVR spread spectrum percent value. + +``dlvr_control_mode`` (RW) + Specifies how frequencies are spread using spread spectrum. + 0: Down spread, + 1: Spread in the Center. + +``dlvr_control_lock`` (RW) + 1: future writes are ignored. + +DPTF Power supply and Battery Interface +---------------------------------------- + +Refer to Documentation/ABI/testing/sysfs-platform-dptf + +DPTF Fan Control +---------------------------------------- + +Refer to Documentation/admin-guide/acpi/fan_performance_states.rst + +Workload Type Hints +---------------------------------------- + +The firmware in Meteor Lake processor generation is capable of identifying +workload type and passing hints regarding it to the OS. A special sysfs +interface is provided to allow user space to obtain workload type hints from +the firmware and control the rate at which they are provided. + +User space can poll attribute "workload_type_index" for the current hint or +can receive a notification whenever the value of this attribute is updated. + +file:`/sys/bus/pci/devices/0000:00:04.0/workload_hint/` +Segment 0, bus 0, device 4, function 0 is reserved for the processor thermal +device on all Intel client processors. So, the above path doesn't change +based on the processor generation. + +``workload_hint_enable`` (RW) + Enable firmware to send workload type hints to user space. + +``notification_delay_ms`` (RW) + Minimum delay in milliseconds before firmware will notify OS. This is + for the rate control of notifications. This delay is between changing + the workload type prediction in the firmware and notifying the OS about + the change. The default delay is 1024 ms. The delay of 0 is invalid. + The delay is rounded up to the nearest power of 2 to simplify firmware + programming of the delay value. The read of notification_delay_ms + attribute shows the effective value used. + +``workload_type_index`` (RO) + Predicted workload type index. User space can get notification of + change via existing sysfs attribute change notification mechanism. + + The supported index values and their meaning for the Meteor Lake + processor generation are as follows: + + 0 - Idle: System performs no tasks, power and idle residency are + consistently low for long periods of time. + + 1 – Battery Life: Power is relatively low, but the processor may + still be actively performing a task, such as video playback for + a long period of time. + + 2 – Sustained: Power level that is relatively high for a long period + of time, with very few to no periods of idleness, which will + eventually exhaust RAPL Power Limit 1 and 2. + + 3 – Bursty: Consumes a relatively constant average amount of power, but + periods of relative idleness are interrupted by bursts of + activity. The bursts are relatively short and the periods of + relative idleness between them typically prevent RAPL Power + Limit 1 from being exhausted. + + 4 – Unknown: Can't classify. diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/driver-api/thermal/intel_powerclamp.rst deleted file mode 100644 index 3f6dfb0b3ea6..000000000000 --- a/Documentation/driver-api/thermal/intel_powerclamp.rst +++ /dev/null @@ -1,320 +0,0 @@ -======================= -Intel Powerclamp Driver -======================= - -By: - - Arjan van de Ven <arjan@linux.intel.com> - - Jacob Pan <jacob.jun.pan@linux.intel.com> - -.. Contents: - - (*) Introduction - - Goals and Objectives - - (*) Theory of Operation - - Idle Injection - - Calibration - - (*) Performance Analysis - - Effectiveness and Limitations - - Power vs Performance - - Scalability - - Calibration - - Comparison with Alternative Techniques - - (*) Usage and Interfaces - - Generic Thermal Layer (sysfs) - - Kernel APIs (TBD) - -INTRODUCTION -============ - -Consider the situation where a system’s power consumption must be -reduced at runtime, due to power budget, thermal constraint, or noise -level, and where active cooling is not preferred. Software managed -passive power reduction must be performed to prevent the hardware -actions that are designed for catastrophic scenarios. - -Currently, P-states, T-states (clock modulation), and CPU offlining -are used for CPU throttling. - -On Intel CPUs, C-states provide effective power reduction, but so far -they’re only used opportunistically, based on workload. With the -development of intel_powerclamp driver, the method of synchronizing -idle injection across all online CPU threads was introduced. The goal -is to achieve forced and controllable C-state residency. - -Test/Analysis has been made in the areas of power, performance, -scalability, and user experience. In many cases, clear advantage is -shown over taking the CPU offline or modulating the CPU clock. - - -THEORY OF OPERATION -=================== - -Idle Injection --------------- - -On modern Intel processors (Nehalem or later), package level C-state -residency is available in MSRs, thus also available to the kernel. - -These MSRs are:: - - #define MSR_PKG_C2_RESIDENCY 0x60D - #define MSR_PKG_C3_RESIDENCY 0x3F8 - #define MSR_PKG_C6_RESIDENCY 0x3F9 - #define MSR_PKG_C7_RESIDENCY 0x3FA - -If the kernel can also inject idle time to the system, then a -closed-loop control system can be established that manages package -level C-state. The intel_powerclamp driver is conceived as such a -control system, where the target set point is a user-selected idle -ratio (based on power reduction), and the error is the difference -between the actual package level C-state residency ratio and the target idle -ratio. - -Injection is controlled by high priority kernel threads, spawned for -each online CPU. - -These kernel threads, with SCHED_FIFO class, are created to perform -clamping actions of controlled duty ratio and duration. Each per-CPU -thread synchronizes its idle time and duration, based on the rounding -of jiffies, so accumulated errors can be prevented to avoid a jittery -effect. Threads are also bound to the CPU such that they cannot be -migrated, unless the CPU is taken offline. In this case, threads -belong to the offlined CPUs will be terminated immediately. - -Running as SCHED_FIFO and relatively high priority, also allows such -scheme to work for both preemptable and non-preemptable kernels. -Alignment of idle time around jiffies ensures scalability for HZ -values. This effect can be better visualized using a Perf timechart. -The following diagram shows the behavior of kernel thread -kidle_inject/cpu. During idle injection, it runs monitor/mwait idle -for a given "duration", then relinquishes the CPU to other tasks, -until the next time interval. - -The NOHZ schedule tick is disabled during idle time, but interrupts -are not masked. Tests show that the extra wakeups from scheduler tick -have a dramatic impact on the effectiveness of the powerclamp driver -on large scale systems (Westmere system with 80 processors). - -:: - - CPU0 - ____________ ____________ - kidle_inject/0 | sleep | mwait | sleep | - _________| |________| |_______ - duration - CPU1 - ____________ ____________ - kidle_inject/1 | sleep | mwait | sleep | - _________| |________| |_______ - ^ - | - | - roundup(jiffies, interval) - -Only one CPU is allowed to collect statistics and update global -control parameters. This CPU is referred to as the controlling CPU in -this document. The controlling CPU is elected at runtime, with a -policy that favors BSP, taking into account the possibility of a CPU -hot-plug. - -In terms of dynamics of the idle control system, package level idle -time is considered largely as a non-causal system where its behavior -cannot be based on the past or current input. Therefore, the -intel_powerclamp driver attempts to enforce the desired idle time -instantly as given input (target idle ratio). After injection, -powerclamp monitors the actual idle for a given time window and adjust -the next injection accordingly to avoid over/under correction. - -When used in a causal control system, such as a temperature control, -it is up to the user of this driver to implement algorithms where -past samples and outputs are included in the feedback. For example, a -PID-based thermal controller can use the powerclamp driver to -maintain a desired target temperature, based on integral and -derivative gains of the past samples. - - - -Calibration ------------ -During scalability testing, it is observed that synchronized actions -among CPUs become challenging as the number of cores grows. This is -also true for the ability of a system to enter package level C-states. - -To make sure the intel_powerclamp driver scales well, online -calibration is implemented. The goals for doing such a calibration -are: - -a) determine the effective range of idle injection ratio -b) determine the amount of compensation needed at each target ratio - -Compensation to each target ratio consists of two parts: - - a) steady state error compensation - This is to offset the error occurring when the system can - enter idle without extra wakeups (such as external interrupts). - - b) dynamic error compensation - When an excessive amount of wakeups occurs during idle, an - additional idle ratio can be added to quiet interrupts, by - slowing down CPU activities. - -A debugfs file is provided for the user to examine compensation -progress and results, such as on a Westmere system:: - - [jacob@nex01 ~]$ cat - /sys/kernel/debug/intel_powerclamp/powerclamp_calib - controlling cpu: 0 - pct confidence steady dynamic (compensation) - 0 0 0 0 - 1 1 0 0 - 2 1 1 0 - 3 3 1 0 - 4 3 1 0 - 5 3 1 0 - 6 3 1 0 - 7 3 1 0 - 8 3 1 0 - ... - 30 3 2 0 - 31 3 2 0 - 32 3 1 0 - 33 3 2 0 - 34 3 1 0 - 35 3 2 0 - 36 3 1 0 - 37 3 2 0 - 38 3 1 0 - 39 3 2 0 - 40 3 3 0 - 41 3 1 0 - 42 3 2 0 - 43 3 1 0 - 44 3 1 0 - 45 3 2 0 - 46 3 3 0 - 47 3 0 0 - 48 3 2 0 - 49 3 3 0 - -Calibration occurs during runtime. No offline method is available. -Steady state compensation is used only when confidence levels of all -adjacent ratios have reached satisfactory level. A confidence level -is accumulated based on clean data collected at runtime. Data -collected during a period without extra interrupts is considered -clean. - -To compensate for excessive amounts of wakeup during idle, additional -idle time is injected when such a condition is detected. Currently, -we have a simple algorithm to double the injection ratio. A possible -enhancement might be to throttle the offending IRQ, such as delaying -EOI for level triggered interrupts. But it is a challenge to be -non-intrusive to the scheduler or the IRQ core code. - - -CPU Online/Offline ------------------- -Per-CPU kernel threads are started/stopped upon receiving -notifications of CPU hotplug activities. The intel_powerclamp driver -keeps track of clamping kernel threads, even after they are migrated -to other CPUs, after a CPU offline event. - - -Performance Analysis -==================== -This section describes the general performance data collected on -multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). - -Effectiveness and Limitations ------------------------------ -The maximum range that idle injection is allowed is capped at 50 -percent. As mentioned earlier, since interrupts are allowed during -forced idle time, excessive interrupts could result in less -effectiveness. The extreme case would be doing a ping -f to generated -flooded network interrupts without much CPU acknowledgement. In this -case, little can be done from the idle injection threads. In most -normal cases, such as scp a large file, applications can be throttled -by the powerclamp driver, since slowing down the CPU also slows down -network protocol processing, which in turn reduces interrupts. - -When control parameters change at runtime by the controlling CPU, it -may take an additional period for the rest of the CPUs to catch up -with the changes. During this time, idle injection is out of sync, -thus not able to enter package C- states at the expected ratio. But -this effect is minor, in that in most cases change to the target -ratio is updated much less frequently than the idle injection -frequency. - -Scalability ------------ -Tests also show a minor, but measurable, difference between the 4P/8P -Ivy Bridge system and the 80P Westmere server under 50% idle ratio. -More compensation is needed on Westmere for the same amount of -target idle ratio. The compensation also increases as the idle ratio -gets larger. The above reason constitutes the need for the -calibration code. - -On the IVB 8P system, compared to an offline CPU, powerclamp can -achieve up to 40% better performance per watt. (measured by a spin -counter summed over per CPU counting threads spawned for all running -CPUs). - -Usage and Interfaces -==================== -The powerclamp driver is registered to the generic thermal layer as a -cooling device. Currently, it’s not bound to any thermal zones:: - - jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * - cur_state:0 - max_state:50 - type:intel_powerclamp - -cur_state allows user to set the desired idle percentage. Writing 0 to -cur_state will stop idle injection. Writing a value between 1 and -max_state will start the idle injection. Reading cur_state returns the -actual and current idle percentage. This may not be the same value -set by the user in that current idle percentage depends on workload -and includes natural idle. When idle injection is disabled, reading -cur_state returns value -1 instead of 0 which is to avoid confusing -100% busy state with the disabled state. - -Example usage: -- To inject 25% idle time:: - - $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state - -If the system is not busy and has more than 25% idle time already, -then the powerclamp driver will not start idle injection. Using Top -will not show idle injection kernel threads. - -If the system is busy (spin test below) and has less than 25% natural -idle time, powerclamp kernel threads will do idle injection. Forced -idle time is accounted as normal idle in that common code path is -taken as the idle task. - -In this example, 24.1% idle is shown. This helps the system admin or -user determine the cause of slowdown, when a powerclamp driver is in action:: - - - Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie - Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st - Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers - Swap: 4087804k total, 0k used, 4087804k free, 945336k cached - - PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND - 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin - 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 - 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 - 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 - 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 - 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox - 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg - 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz - -Tests have shown that by using the powerclamp driver as a cooling -device, a PID based userspace thermal controller can manage to -control CPU temperature effectively, when no other thermal influence -is added. For example, a UltraBook user can compile the kernel under -certain temperature (below most active trip points). diff --git a/Documentation/driver-api/thermal/nouveau_thermal.rst b/Documentation/driver-api/thermal/nouveau_thermal.rst index 37255fd6735d..aa10db6df309 100644 --- a/Documentation/driver-api/thermal/nouveau_thermal.rst +++ b/Documentation/driver-api/thermal/nouveau_thermal.rst @@ -90,7 +90,7 @@ Bug reports ----------- Thermal management on Nouveau is new and may not work on all cards. If you have -inquiries, please ping mupuf on IRC (#nouveau, freenode). +inquiries, please ping mupuf on IRC (#nouveau, OFTC). Bug reports should be filled on Freedesktop's bug tracker. Please follow -http://nouveau.freedesktop.org/wiki/Bugs +https://nouveau.freedesktop.org/wiki/Bugs diff --git a/Documentation/driver-api/thermal/power_allocator.rst b/Documentation/driver-api/thermal/power_allocator.rst index 67b6a3297238..aa5f66552d6f 100644 --- a/Documentation/driver-api/thermal/power_allocator.rst +++ b/Documentation/driver-api/thermal/power_allocator.rst @@ -71,7 +71,9 @@ to the speed-grade of the silicon. `sustainable_power` is therefore simply an estimate, and may be tuned to affect the aggressiveness of the thermal ramp. For reference, the sustainable power of a 4" phone is typically 2000mW, while on a 10" tablet is around 4500mW (may vary -depending on screen size). +depending on screen size). It is possible to have the power value +expressed in an abstract scale. The sustained power should be aligned +to the scale used by the related cooling devices. If you are using device tree, do add it as a property of the thermal-zone. For example:: @@ -269,3 +271,11 @@ won't be very good. Note that this is not particular to this governor, step-wise will also misbehave if you call its throttle() faster than the normal thermal framework tick (due to interrupts for example) as it will overreact. + +Energy Model requirements +========================= + +Another important thing is the consistent scale of the power values +provided by the cooling devices. All of the cooling devices in a single +thermal zone should have power values reported either in milli-Watts +or scaled to the same 'abstract scale'. diff --git a/Documentation/driver-api/thermal/sysfs-api.rst b/Documentation/driver-api/thermal/sysfs-api.rst index b40b1f839148..6c1175c6afba 100644 --- a/Documentation/driver-api/thermal/sysfs-api.rst +++ b/Documentation/driver-api/thermal/sysfs-api.rst @@ -54,7 +54,7 @@ temperature) and throttle appropriate devices. trips: the total number of trip points this thermal zone supports. mask: - Bit string: If 'n'th bit is set, then trip point 'n' is writeable. + Bit string: If 'n'th bit is set, then trip point 'n' is writable. devdata: device private data ops: @@ -306,42 +306,6 @@ temperature) and throttle appropriate devices. :: - struct thermal_bind_params - - This structure defines the following parameters that are used to bind - a zone with a cooling device for a particular trip point. - - .cdev: - The cooling device pointer - .weight: - The 'influence' of a particular cooling device on this - zone. This is relative to the rest of the cooling - devices. For example, if all cooling devices have a - weight of 1, then they all contribute the same. You can - use percentages if you want, but it's not mandatory. A - weight of 0 means that this cooling device doesn't - contribute to the cooling of this zone unless all cooling - devices have a weight of 0. If all weights are 0, then - they all contribute the same. - .trip_mask: - This is a bit mask that gives the binding relation between - this thermal zone and cdev, for a particular trip point. - If nth bit is set, then the cdev and thermal zone are bound - for trip point n. - .binding_limits: - This is an array of cooling state limits. Must have - exactly 2 * thermal_zone.number_of_trip_points. It is an - array consisting of tuples <lower-state upper-state> of - state limits. Each trip will be associated with one state - limit tuple when binding. A NULL pointer means - <THERMAL_NO_LIMITS THERMAL_NO_LIMITS> on all trips. - These limits are used when binding a cdev to a trip point. - .match: - This call back returns success(0) if the 'tz and cdev' need to - be bound, as per platform data. - - :: - struct thermal_zone_params This structure defines the platform level parameters for a thermal zone. @@ -357,10 +321,6 @@ temperature) and throttle appropriate devices. will be created. when no_hwmon == true, nothing will be done. In case the thermal_zone_params is NULL, the hwmon interface will be created (for backward compatibility). - .num_tbps: - Number of thermal_bind_params entries for this zone - .tbp: - thermal_bind_params entries 2. sysfs attributes structure ============================= @@ -406,7 +366,7 @@ Thermal cooling device sys I/F, created once it's registered:: |---stats/reset: Writing any value resets the statistics |---stats/time_in_state_ms: Time (msec) spent in various cooling states |---stats/total_trans: Total number of times cooling state is changed - |---stats/trans_table: Cooing state transition table + |---stats/trans_table: Cooling state transition table Then next two dynamic attributes are created/removed in pairs. They represent @@ -428,6 +388,9 @@ of thermal zone device. E.g. the generic thermal driver registers one hwmon class device and build the associated hwmon sysfs I/F for all the registered ACPI thermal zones. +Please read Documentation/ABI/testing/sysfs-class-thermal for thermal +zone and cooling device attribute details. + :: /sys/class/hwmon/hwmon[0-*]: @@ -437,242 +400,6 @@ ACPI thermal zones. Please read Documentation/hwmon/sysfs-interface.rst for additional information. -Thermal zone attributes ------------------------ - -type - Strings which represent the thermal zone type. - This is given by thermal zone driver as part of registration. - E.g: "acpitz" indicates it's an ACPI thermal device. - In order to keep it consistent with hwmon sys attribute; this should - be a short, lowercase string, not containing spaces nor dashes. - RO, Required - -temp - Current temperature as reported by thermal zone (sensor). - Unit: millidegree Celsius - RO, Required - -mode - One of the predefined values in [enabled, disabled]. - This file gives information about the algorithm that is currently - managing the thermal zone. It can be either default kernel based - algorithm or user space application. - - enabled - enable Kernel Thermal management. - disabled - Preventing kernel thermal zone driver actions upon - trip points so that user application can take full - charge of the thermal management. - - RW, Optional - -policy - One of the various thermal governors used for a particular zone. - - RW, Required - -available_policies - Available thermal governors which can be used for a particular zone. - - RO, Required - -`trip_point_[0-*]_temp` - The temperature above which trip point will be fired. - - Unit: millidegree Celsius - - RO, Optional - -`trip_point_[0-*]_type` - Strings which indicate the type of the trip point. - - E.g. it can be one of critical, hot, passive, `active[0-*]` for ACPI - thermal zone. - - RO, Optional - -`trip_point_[0-*]_hyst` - The hysteresis value for a trip point, represented as an integer - Unit: Celsius - RW, Optional - -`cdev[0-*]` - Sysfs link to the thermal cooling device node where the sys I/F - for cooling device throttling control represents. - - RO, Optional - -`cdev[0-*]_trip_point` - The trip point in this thermal zone which `cdev[0-*]` is associated - with; -1 means the cooling device is not associated with any trip - point. - - RO, Optional - -`cdev[0-*]_weight` - The influence of `cdev[0-*]` in this thermal zone. This value - is relative to the rest of cooling devices in the thermal - zone. For example, if a cooling device has a weight double - than that of other, it's twice as effective in cooling the - thermal zone. - - RW, Optional - -passive - Attribute is only present for zones in which the passive cooling - policy is not supported by native thermal driver. Default is zero - and can be set to a temperature (in millidegrees) to enable a - passive trip point for the zone. Activation is done by polling with - an interval of 1 second. - - Unit: millidegrees Celsius - - Valid values: 0 (disabled) or greater than 1000 - - RW, Optional - -emul_temp - Interface to set the emulated temperature method in thermal zone - (sensor). After setting this temperature, the thermal zone may pass - this temperature to platform emulation function if registered or - cache it locally. This is useful in debugging different temperature - threshold and its associated cooling action. This is write only node - and writing 0 on this node should disable emulation. - Unit: millidegree Celsius - - WO, Optional - - WARNING: - Be careful while enabling this option on production systems, - because userland can easily disable the thermal policy by simply - flooding this sysfs node with low temperature values. - -sustainable_power - An estimate of the sustained power that can be dissipated by - the thermal zone. Used by the power allocator governor. For - more information see Documentation/driver-api/thermal/power_allocator.rst - - Unit: milliwatts - - RW, Optional - -k_po - The proportional term of the power allocator governor's PID - controller during temperature overshoot. Temperature overshoot - is when the current temperature is above the "desired - temperature" trip point. For more information see - Documentation/driver-api/thermal/power_allocator.rst - - RW, Optional - -k_pu - The proportional term of the power allocator governor's PID - controller during temperature undershoot. Temperature undershoot - is when the current temperature is below the "desired - temperature" trip point. For more information see - Documentation/driver-api/thermal/power_allocator.rst - - RW, Optional - -k_i - The integral term of the power allocator governor's PID - controller. This term allows the PID controller to compensate - for long term drift. For more information see - Documentation/driver-api/thermal/power_allocator.rst - - RW, Optional - -k_d - The derivative term of the power allocator governor's PID - controller. For more information see - Documentation/driver-api/thermal/power_allocator.rst - - RW, Optional - -integral_cutoff - Temperature offset from the desired temperature trip point - above which the integral term of the power allocator - governor's PID controller starts accumulating errors. For - example, if integral_cutoff is 0, then the integral term only - accumulates error when temperature is above the desired - temperature trip point. For more information see - Documentation/driver-api/thermal/power_allocator.rst - - Unit: millidegree Celsius - - RW, Optional - -slope - The slope constant used in a linear extrapolation model - to determine a hotspot temperature based off the sensor's - raw readings. It is up to the device driver to determine - the usage of these values. - - RW, Optional - -offset - The offset constant used in a linear extrapolation model - to determine a hotspot temperature based off the sensor's - raw readings. It is up to the device driver to determine - the usage of these values. - - RW, Optional - -Cooling device attributes -------------------------- - -type - String which represents the type of device, e.g: - - - for generic ACPI: should be "Fan", "Processor" or "LCD" - - for memory controller device on intel_menlow platform: - should be "Memory controller". - - RO, Required - -max_state - The maximum permissible cooling state of this cooling device. - - RO, Required - -cur_state - The current cooling state of this cooling device. - The value can any integer numbers between 0 and max_state: - - - cur_state == 0 means no cooling - - cur_state == max_state means the maximum cooling. - - RW, Required - -stats/reset - Writing any value resets the cooling device's statistics. - WO, Required - -stats/time_in_state_ms: - The amount of time spent by the cooling device in various cooling - states. The output will have "<state> <time>" pair in each line, which - will mean this cooling device spent <time> msec of time at <state>. - Output will have one line for each of the supported states. usertime - units here is 10mS (similar to other time exported in /proc). - RO, Required - - -stats/total_trans: - A single positive value showing the total number of times the state of a - cooling device is changed. - - RO, Required - -stats/trans_table: - This gives fine grained information about all the cooling state - transitions. The cat output here is a two dimensional matrix, where an - entry <i,j> (row i, column j) represents the number of transitions from - State_i to State_j. If the transition table is bigger than PAGE_SIZE, - reading this will return an -EFBIG error. - RO, Required - 3. A simple implementation ========================== @@ -744,17 +471,7 @@ This function returns the thermal_instance corresponding to a given {thermal_zone, cooling_device, trip_point} combination. Returns NULL if such an instance does not exist. -4.3. thermal_notify_framework ------------------------------ - -This function handles the trip events from sensor drivers. It starts -throttling the cooling devices according to the policy configured. -For CRITICAL and HOT trip points, this notifies the respective drivers, -and does actual throttling for other trip points i.e ACTIVE and PASSIVE. -The throttling policy is based on the configured platform data; if no -platform data is provided, this uses the step_wise throttling policy. - -4.4. thermal_cdev_update +4.3. thermal_cdev_update ------------------------ This function serves as an arbitrator to set the state of a cooling @@ -764,21 +481,15 @@ possible. 5. thermal_emergency_poweroff ============================= -On an event of critical trip temperature crossing. Thermal framework -allows the system to shutdown gracefully by calling orderly_poweroff(). -In the event of a failure of orderly_poweroff() to shut down the system -we are in danger of keeping the system alive at undesirably high -temperatures. To mitigate this high risk scenario we program a work -queue to fire after a pre-determined number of seconds to start -an emergency shutdown of the device using the kernel_power_off() -function. In case kernel_power_off() fails then finally -emergency_restart() is called in the worst case. +On an event of critical trip temperature crossing the thermal framework +shuts down the system by calling hw_protection_shutdown(). The +hw_protection_shutdown() first attempts to perform an orderly shutdown +but accepts a delay after which it proceeds doing a forced power-off +or as last resort an emergency_restart. The delay should be carefully profiled so as to give adequate time for -orderly_poweroff(). In case of failure of an orderly_poweroff() the -emergency poweroff kicks in after the delay has elapsed and shuts down -the system. +orderly poweroff. -If set to 0 emergency poweroff will not be supported. So a carefully -profiled non-zero positive value is a must for emergerncy poweroff to be -triggered. +If the delay is set to 0 emergency poweroff will not be supported. So a +carefully profiled non-zero positive value is a must for emergency +poweroff to be triggered. |