8 files changed, 224 insertions, 121 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 46b26bfee27b..281c85839c17 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -223,7 +223,7 @@
 
 	acpi_sleep=	[HW,ACPI] Sleep options
 			Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
-				  old_ordering, nonvs, sci_force_enable }
+				  old_ordering, nonvs, sci_force_enable, nobl }
 			See Documentation/power/video.txt for information on
 			s3_bios and s3_mode.
 			s3_beep is for debugging; it makes the PC's speaker beep
@@ -239,6 +239,9 @@
 			sci_force_enable causes the kernel to set SCI_EN directly
 			on resume from S1/S3 (which is against the ACPI spec,
 			but some broken systems don't work without it).
+			nobl causes the internal blacklist of systems known to
+			behave incorrectly in some ways with respect to system
+			suspend and resume to be ignored (use wisely).
 
 	acpi_use_timer_override [HW,ACPI]
 			Use timer override. For some broken Nvidia NF5 boards
diff --git a/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt b/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt
index 51336e5fc761..35c3c3460d17 100644
--- a/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt
+++ b/Documentation/devicetree/bindings/arm/marvell/armada-37xx.txt
@@ -14,3 +14,22 @@ following property before the previous one:
 Example:
 
 compatible = "marvell,armada-3720-db", "marvell,armada3720", "marvell,armada3710";
+
+
+Power management
+----------------
+
+For power management (particularly DVFS and AVS), the North Bridge
+Power Management component is needed:
+
+Required properties:
+- compatible     : should contain "marvell,armada-3700-nb-pm", "syscon";
+- reg            : the register start and length for the North Bridge
+		    Power Management
+
+Example:
+
+nb_pm: syscon@14000 {
+	compatible = "marvell,armada-3700-nb-pm", "syscon";
+	reg = <0x14000 0x60>;
+}
diff --git a/Documentation/devicetree/bindings/opp/opp.txt b/Documentation/devicetree/bindings/opp/opp.txt
index 9d733af26be7..4e4f30288c8b 100644
--- a/Documentation/devicetree/bindings/opp/opp.txt
+++ b/Documentation/devicetree/bindings/opp/opp.txt
@@ -45,6 +45,11 @@ Devices supporting OPPs must set their "operating-points-v2" property with
 phandle to a OPP table in their DT node. The OPP core will use this phandle to
 find the operating points for the device.
 
+This can contain more than one phandle for power domain providers that provide
+multiple power domains. That is, one phandle for each power domain. If only one
+phandle is available, then the same OPP table will be used for all power domains
+provided by the power domain provider.
+
 If required, this can be extended for SoC vendor specific bindings. Such bindings
 should be documented as Documentation/devicetree/bindings/power/<vendor>-opp.txt
 and should have a compatible description like: "operating-points-v2-<vendor>".
@@ -154,6 +159,14 @@ Optional properties:
 
 - status: Marks the node enabled/disabled.
 
+- required-opp: This contains phandle to an OPP node in another device's OPP
+  table. It may contain an array of phandles, where each phandle points to an
+  OPP of a different device. It should not contain multiple phandles to the OPP
+  nodes in the same OPP table. This specifies the minimum required OPP of the
+  device(s), whose OPP's phandle is present in this property, for the
+  functioning of the current device at the current OPP (where this property is
+  present).
+
 Example 1: Single cluster Dual-core ARM cortex A9, switch DVFS states together.
 
 / {
diff --git a/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt b/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt
new file mode 100644
index 000000000000..832346e489a3
--- /dev/null
+++ b/Documentation/devicetree/bindings/opp/ti-omap5-opp-supply.txt
@@ -0,0 +1,63 @@
+Texas Instruments OMAP compatible OPP supply description
+
+OMAP5, DRA7, and AM57 family of SoCs have Class0 AVS eFuse registers which
+contain data that can be used to adjust voltages programmed for some of their
+supplies for more efficient operation. This binding provides the information
+needed to read these values and use them to program the main regulator during
+an OPP transitions.
+
+Also, some supplies may have an associated vbb-supply which is an Adaptive Body
+Bias regulator which much be transitioned in a specific sequence with regards
+to the vdd-supply and clk when making an OPP transition. By supplying two
+regulators to the device that will undergo OPP transitions we can make use
+of the multi regulator binding that is part of the OPP core described here [1]
+to describe both regulators needed by the platform.
+
+[1] Documentation/devicetree/bindings/opp/opp.txt
+
+Required Properties for Device Node:
+- vdd-supply: phandle to regulator controlling VDD supply
+- vbb-supply: phandle to regulator controlling Body Bias supply
+	      (Usually Adaptive Body Bias regulator)
+
+Required Properties for opp-supply node:
+- compatible: Should be one of:
+	"ti,omap-opp-supply" - basic OPP supply controlling VDD and VBB
+	"ti,omap5-opp-supply" - OMAP5+ optimized voltages in efuse(class0)VDD
+			    along with VBB
+	"ti,omap5-core-opp-supply" - OMAP5+ optimized voltages in efuse(class0) VDD
+			    but no VBB.
+- reg: Address and length of the efuse register set for the device (mandatory
+	only for "ti,omap5-opp-supply")
+- ti,efuse-settings: An array of u32 tuple items providing information about
+	optimized efuse configuration. Each item consists of the following:
+	volt: voltage in uV - reference voltage (OPP voltage)
+	efuse_offseet: efuse offset from reg where the optimized voltage is stored.
+- ti,absolute-max-voltage-uv: absolute maximum voltage for the OPP supply.
+
+Example:
+
+/* Device Node (CPU)  */
+cpus {
+	cpu0: cpu@0 {
+		device_type = "cpu";
+
+		...
+
+		vdd-supply = <&vcc>;
+		vbb-supply = <&abb_mpu>;
+	};
+};
+
+/* OMAP OPP Supply with Class0 registers */
+opp_supply_mpu: opp_supply@4a003b20 {
+	compatible = "ti,omap5-opp-supply";
+	reg = <0x4a003b20 0x8>;
+	ti,efuse-settings = <
+	/* uV   offset */
+	1060000 0x0
+	1160000 0x4
+	1210000 0x8
+	>;
+	ti,absolute-max-voltage-uv = <1500000>;
+};
diff --git a/Documentation/devicetree/bindings/power/power_domain.txt b/Documentation/devicetree/bindings/power/power_domain.txt
index 14bd9e945ff6..f3355313c020 100644
--- a/Documentation/devicetree/bindings/power/power_domain.txt
+++ b/Documentation/devicetree/bindings/power/power_domain.txt
@@ -40,6 +40,12 @@ Optional properties:
   domain's idle states. In the absence of this property, the domain would be
   considered as capable of being powered-on or powered-off.
 
+- operating-points-v2 : Phandles to the OPP tables of power domains provided by
+  a power domain provider. If the provider provides a single power domain only
+  or all the power domains provided by the provider have identical OPP tables,
+  then this shall contain a single phandle. Refer to ../opp/opp.txt for more
+  information.
+
 Example:
 
 	power: power-controller@12340000 {
@@ -120,4 +126,63 @@ The node above defines a typical PM domain consumer device, which is located
 inside a PM domain with index 0 of a power controller represented by a node
 with the label "power".
 
+Optional properties:
+- required-opp: This contains phandle to an OPP node in another device's OPP
+  table. It may contain an array of phandles, where each phandle points to an
+  OPP of a different device. It should not contain multiple phandles to the OPP
+  nodes in the same OPP table. This specifies the minimum required OPP of the
+  device(s), whose OPP's phandle is present in this property, for the
+  functioning of the current device at the current OPP (where this property is
+  present).
+
+Example:
+- OPP table for domain provider that provides two domains.
+
+	domain0_opp_table: opp-table0 {
+		compatible = "operating-points-v2";
+
+		domain0_opp_0: opp-1000000000 {
+			opp-hz = /bits/ 64 <1000000000>;
+			opp-microvolt = <975000 970000 985000>;
+		};
+		domain0_opp_1: opp-1100000000 {
+			opp-hz = /bits/ 64 <1100000000>;
+			opp-microvolt = <1000000 980000 1010000>;
+		};
+	};
+
+	domain1_opp_table: opp-table1 {
+		compatible = "operating-points-v2";
+
+		domain1_opp_0: opp-1200000000 {
+			opp-hz = /bits/ 64 <1200000000>;
+			opp-microvolt = <975000 970000 985000>;
+		};
+		domain1_opp_1: opp-1300000000 {
+			opp-hz = /bits/ 64 <1300000000>;
+			opp-microvolt = <1000000 980000 1010000>;
+		};
+	};
+
+	power: power-controller@12340000 {
+		compatible = "foo,power-controller";
+		reg = <0x12340000 0x1000>;
+		#power-domain-cells = <1>;
+		operating-points-v2 = <&domain0_opp_table>, <&domain1_opp_table>;
+	};
+
+	leaky-device0@12350000 {
+		compatible = "foo,i-leak-current";
+		reg = <0x12350000 0x1000>;
+		power-domains = <&power 0>;
+		required-opp = <&domain0_opp_0>;
+	};
+
+	leaky-device1@12350000 {
+		compatible = "foo,i-leak-current";
+		reg = <0x12350000 0x1000>;
+		power-domains = <&power 1>;
+		required-opp = <&domain1_opp_1>;
+	};
+
 [1]. Documentation/devicetree/bindings/power/domain-idle-state.txt
diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst
index 53c1b0b06da5..1128705a5731 100644
--- a/Documentation/driver-api/pm/devices.rst
+++ b/Documentation/driver-api/pm/devices.rst
@@ -777,17 +777,51 @@ The driver can indicate that by setting ``DPM_FLAG_SMART_SUSPEND`` in
 runtime suspend at the beginning of the ``suspend_late`` phase of system-wide
 suspend (or in the ``poweroff_late`` phase of hibernation), when runtime PM
 has been disabled for it, under the assumption that its state should not change
-after that point until the system-wide transition is over.  If that happens, the
-driver's system-wide resume callbacks, if present, may still be invoked during
-the subsequent system-wide resume transition and the device's runtime power
-management status may be set to "active" before enabling runtime PM for it,
-so the driver must be prepared to cope with the invocation of its system-wide
-resume callbacks back-to-back with its ``->runtime_suspend`` one (without the
-intervening ``->runtime_resume`` and so on) and the final state of the device
-must reflect the "active" status for runtime PM in that case.
+after that point until the system-wide transition is over (the PM core itself
+does that for devices whose "noirq", "late" and "early" system-wide PM callbacks
+are executed directly by it).  If that happens, the driver's system-wide resume
+callbacks, if present, may still be invoked during the subsequent system-wide
+resume transition and the device's runtime power management status may be set
+to "active" before enabling runtime PM for it, so the driver must be prepared to
+cope with the invocation of its system-wide resume callbacks back-to-back with
+its ``->runtime_suspend`` one (without the intervening ``->runtime_resume`` and
+so on) and the final state of the device must reflect the "active" runtime PM
+status in that case.
 
 During system-wide resume from a sleep state it's easiest to put devices into
 the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
-Refer to that document for more information regarding this particular issue as
+[Refer to that document for more information regarding this particular issue as
 well as for information on the device runtime power management framework in
-general.
+general.]
+
+However, it often is desirable to leave devices in suspend after system
+transitions to the working state, especially if those devices had been in
+runtime suspend before the preceding system-wide suspend (or analogous)
+transition.  Device drivers can use the ``DPM_FLAG_LEAVE_SUSPENDED`` flag to
+indicate to the PM core (and middle-layer code) that they prefer the specific
+devices handled by them to be left suspended and they have no problems with
+skipping their system-wide resume callbacks for this reason.  Whether or not the
+devices will actually be left in suspend may depend on their state before the
+given system suspend-resume cycle and on the type of the system transition under
+way.  In particular, devices are not left suspended if that transition is a
+restore from hibernation, as device states are not guaranteed to be reflected
+by the information stored in the hibernation image in that case.
+
+The middle-layer code involved in the handling of the device is expected to
+indicate to the PM core if the device may be left in suspend by setting its
+:c:member:`power.may_skip_resume` status bit which is checked by the PM core
+during the "noirq" phase of the preceding system-wide suspend (or analogous)
+transition.  The middle layer is then responsible for handling the device as
+appropriate in its "noirq" resume callback, which is executed regardless of
+whether or not the device is left suspended, but the other resume callbacks
+(except for ``->complete``) will be skipped automatically by the PM core if the
+device really can be left in suspend.
+
+For devices whose "noirq", "late" and "early" driver callbacks are invoked
+directly by the PM core, all of the system-wide resume callbacks are skipped if
+``DPM_FLAG_LEAVE_SUSPENDED`` is set and the device is in runtime suspend during
+the ``suspend_noirq`` (or analogous) phase or the transition under way is a
+proper system suspend (rather than anything related to hibernation) and the
+device's wakeup settings are suitable for runtime PM (that is, it cannot
+generate wakeup signals at all or it is allowed to wake up the system from
+sleep).
diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt
index 704cd36079b8..8eaf9ee24d43 100644
--- a/Documentation/power/pci.txt
+++ b/Documentation/power/pci.txt
@@ -994,6 +994,17 @@ into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(),
 the function will set the power.direct_complete flag for it (to make the PM core
 skip the subsequent "thaw" callbacks for it) and return.
 
+Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the
+device to be left in suspend after system-wide transitions to the working state.
+This flag is checked by the PM core, but the PCI bus type informs the PM core
+which devices may be left in suspend from its perspective (that happens during
+the "noirq" phase of system-wide suspend and analogous transitions) and next it
+uses the dev_pm_may_skip_resume() helper to decide whether or not to return from
+pci_pm_resume_noirq() early, as the PM core will skip the remaining resume
+callbacks for the device during the transition under way and will set its
+runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for
+it.
+
 3.2. Device Runtime Power Management
 ------------------------------------
 In addition to providing device power management callbacks PCI device drivers
diff --git a/Documentation/thermal/cpu-cooling-api.txt b/Documentation/thermal/cpu-cooling-api.txt
index 71653584cd03..7df567eaea1a 100644
--- a/Documentation/thermal/cpu-cooling-api.txt
+++ b/Documentation/thermal/cpu-cooling-api.txt
@@ -26,39 +26,16 @@ the user. The registration APIs returns the cooling device pointer.
    clip_cpus: cpumask of cpus where the frequency constraints will happen.
 
 1.1.2 struct thermal_cooling_device *of_cpufreq_cooling_register(
-	struct device_node *np, const struct cpumask *clip_cpus)
+					struct cpufreq_policy *policy)
 
     This interface function registers the cpufreq cooling device with
     the name "thermal-cpufreq-%x" linking it with a device tree node, in
     order to bind it via the thermal DT code. This api can support multiple
     instances of cpufreq cooling devices.
 
-    np: pointer to the cooling device device tree node
-    clip_cpus: cpumask of cpus where the frequency constraints will happen.
+    policy: CPUFreq policy.
 
-1.1.3 struct thermal_cooling_device *cpufreq_power_cooling_register(
-    const struct cpumask *clip_cpus, u32 capacitance,
-    get_static_t plat_static_func)
-
-Similar to cpufreq_cooling_register, this function registers a cpufreq
-cooling device.  Using this function, the cooling device will
-implement the power extensions by using a simple cpu power model.  The
-cpus must have registered their OPPs using the OPP library.
-
-The additional parameters are needed for the power model (See 2. Power
-models).  "capacitance" is the dynamic power coefficient (See 2.1
-Dynamic power).  "plat_static_func" is a function to calculate the
-static power consumed by these cpus (See 2.2 Static power).
-
-1.1.4 struct thermal_cooling_device *of_cpufreq_power_cooling_register(
-    struct device_node *np, const struct cpumask *clip_cpus, u32 capacitance,
-    get_static_t plat_static_func)
-
-Similar to cpufreq_power_cooling_register, this function register a
-cpufreq cooling device with power extensions using the device tree
-information supplied by the np parameter.
-
-1.1.5 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
+1.1.3 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
 
     This interface function unregisters the "thermal-cpufreq-%x" cooling device.
 
@@ -67,20 +44,14 @@ information supplied by the np parameter.
 2. Power models
 
 The power API registration functions provide a simple power model for
-CPUs.  The current power is calculated as dynamic + (optionally)
-static power.  This power model requires that the operating-points of
+CPUs.  The current power is calculated as dynamic power (static power isn't
+supported currently).  This power model requires that the operating-points of
 the CPUs are registered using the kernel's opp library and the
 `cpufreq_frequency_table` is assigned to the `struct device` of the
 cpu.  If you are using CONFIG_CPUFREQ_DT then the
 `cpufreq_frequency_table` should already be assigned to the cpu
 device.
 
-The `plat_static_func` parameter of `cpufreq_power_cooling_register()`
-and `of_cpufreq_power_cooling_register()` is optional.  If you don't
-provide it, only dynamic power will be considered.
-
-2.1 Dynamic power
-
 The dynamic power consumption of a processor depends on many factors.
 For a given processor implementation the primary factors are:
 
@@ -119,79 +90,3 @@ mW/MHz/uVolt^2.  Typical values for mobile CPUs might lie in range
 from 100 to 500.  For reference, the approximate values for the SoC in
 ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and
 140 for the Cortex-A53 cluster.
-
-
-2.2 Static power
-
-Static leakage power consumption depends on a number of factors.  For a
-given circuit implementation the primary factors are:
-
-- Time the circuit spends in each 'power state'
-- Temperature
-- Operating voltage
-- Process grade
-
-The time the circuit spends in each 'power state' for a given
-evaluation period at first order means OFF or ON.  However,
-'retention' states can also be supported that reduce power during
-inactive periods without loss of context.
-
-Note: The visibility of state entries to the OS can vary, according to
-platform specifics, and this can then impact the accuracy of a model
-based on OS state information alone.  It might be possible in some
-cases to extract more accurate information from system resources.
-
-The temperature, operating voltage and process 'grade' (slow to fast)
-of the circuit are all significant factors in static leakage power
-consumption.  All of these have complex relationships to static power.
-
-Circuit implementation specific factors include the chosen silicon
-process as well as the type, number and size of transistors in both
-the logic gates and any RAM elements included.
-
-The static power consumption modelling must take into account the
-power managed regions that are implemented.  Taking the example of an
-ARM processor cluster, the modelling would take into account whether
-each CPU can be powered OFF separately or if only a single power
-region is implemented for the complete cluster.
-
-In one view, there are others, a static power consumption model can
-then start from a set of reference values for each power managed
-region (e.g. CPU, Cluster/L2) in each state (e.g. ON, OFF) at an
-arbitrary process grade, voltage and temperature point.  These values
-are then scaled for all of the following: the time in each state, the
-process grade, the current temperature and the operating voltage.
-However, since both implementation specific and complex relationships
-dominate the estimate, the appropriate interface to the model from the
-cpu cooling device is to provide a function callback that calculates
-the static power in this platform.  When registering the cpu cooling
-device pass a function pointer that follows the `get_static_t`
-prototype:
-
-    int plat_get_static(cpumask_t *cpumask, int interval,
-                        unsigned long voltage, u32 &power);
-
-`cpumask` is the cpumask of the cpus involved in the calculation.
-`voltage` is the voltage at which they are operating.  The function
-should calculate the average static power for the last `interval`
-milliseconds.  It returns 0 on success, -E* on error.  If it
-succeeds, it should store the static power in `power`.  Reading the
-temperature of the cpus described by `cpumask` is left for
-plat_get_static() to do as the platform knows best which thermal
-sensor is closest to the cpu.
-
-If `plat_static_func` is NULL, static power is considered to be
-negligible for this platform and only dynamic power is considered.
-
-The platform specific callback can then use any combination of tables
-and/or equations to permute the estimated value.  Process grade
-information is not passed to the model since access to such data, from
-on-chip measurement capability or manufacture time data, is platform
-specific.
-
-Note: the significance of static power for CPUs in comparison to
-dynamic power is highly dependent on implementation.  Given the
-potential complexity in implementation, the importance and accuracy of
-its inclusion when using cpu cooling devices should be assessed on a
-case by case basis.
-